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Chapter  1 


Summary 


The  key  insight  underlying  this  thesis  is  that  the  right  kind  of  interaction  is  the  key  to  making 
the  intractable  tractable.  This  work  specifically  investigates  this  insight  in  the  context  of  learn¬ 
ing  theory.  While  much  of  the  learning  theory  literature  has  traditionally  focused  on  protocols 
that  are  either  non-interactive  or  involving  unrealistically  strong  forms  of  interaction,  there  have 
recently  been  several  exciting  advances  in  the  design  and  analysis  of  methods  for  realistic  inter¬ 
active  learning  protocols. 

Perhaps  one  of  the  most  interesting  of  these  is  active  learning.  In  active  learning,  a  learning 
algorithm  is  given  access  to  a  large  pool  of  unlabeled  examples,  and  is  allowed  to  sequentially 
request  their  labels  so  as  to  learn  how  to  accurately  predict  the  labels  of  new  examples.  This 
thesis  contains  a  number  of  interesting  advances  in  our  understanding  of  the  capabilities  of  active 
learning  methods.  Specifically,  I  summarize  the  main  contributions  below. 


1.1  Bayesian  Active  Learning 

While  most  of  the  recent  advances  in  our  understanding  of  active  learning  have  focused  on  the 
traditional  PAC  model  (or  noisy  variants  thereof),  similar  advnaces  specific  to  the  Bayesian  learn¬ 
ing  setting  have  largely  been  lacking.  Specifically,  suppose  that  in  addition  to  the  data  itself,  the 
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learner  additionally  has  access  to  a  prior  distribution  for  the  target  function,  and  we  are  inter¬ 
ested  in  achieving  a  guarantee  of  low  expected  error  rate,  where  the  expectation  is  over  both  the 
draw  of  the  data  and  the  draw  of  the  target  concept  from  the  given  prior.  This  setting  has  been 
studied  in  depth  for  the  passive  learning  protocol,  but  aside  from  the  well-known  work  on  the 
query-by-committee  algorithm,  little  was  known  about  this  setting  for  the  active  learning  proto¬ 
col.  This  lack  of  knowledge  is  particularly  troubling  in  light  of  the  fact  that  most  of  the  active 
learning  methods  used  in  practice  have  Bayesian  interpretations,  selecting  their  label  requests 
based  on  Bayesian  notions  such  as  label  entropy,  expected  error  reduction,  or  reduction  in  the 
total  probability  mass  of  the  version  space. 

1.1.1  Arbitrary  Binary-Valued  Queries 

In  this  thesis,  we  present  work  that  makes  progress  in  understanding  the  Bayesian  active  learning 
setting.  To  begin,  we  study  the  most  basic  question:  how  many  queries  are  necessary  if  we 
are  able  to  ask  arbitrary  binary-valued  queries.  While  label  requests  are  only  a  special  type  of 
binary-valued  query,  a  general  lower  bound  for  arbitrary  binary-valued  queries  will  also  hold  for 
label  request  queries,  and  thus  provides  a  lower  bound  on  the  intrinsic  query  complexity  of  the 
learning  problem.  Not  surprisingly,  we  find  that  the  number  of  binary-valued  queries  necessary 
for  learning  is  characterized  by  a  kind  of  entropy  quantity:  namely,  the  entropy  of  the  Voronoi 
partition  induced  by  a  maximal  e-packing. 

1.1.2  Self- Verifying  Active  Learning 

Our  next  contribution  is  a  study  of  a  special  type  of  active  learning,  characterized  by  the  stopping- 
criterion  used  in  the  learning  algorithm.  Specifically,  consider  a  protocol  in  which  the  input  to 
the  active  learning  algorithm  is  the  desired  error  rate  guarantee  e,  and  the  algorithm  then  makes 
a  number  of  queries  and  then  halts.  For  the  algorithm  to  be  considered  “correct”,  it  must  have 
the  guarantee  that  the  expected  error  rate  of  the  classifier  it  produces  after  halting  is  at  most 
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the  value  of  e  provided  as  input.  We  refer  to  this  family  of  algorithms  as  self-verifying.  The 
label  complexity  of  learning  in  this  protocol  is  generally  higher  than  in  some  other  protocols 
(e.g.,  budget-based),  since  the  algorithm  must  not  only  find  a  classifier  with  good  error  rate,  but 
must  also  somehow  be  self-aware  of  the  fact  that  it  has  found  such  a  good  classifier.  Indeed,  it 
is  known  that  prior-independent  self-verifying  algorithms  may  often  have  label  complexities  no 
better  than  that  of  passive  learning,  which  is  ©(1/e)  for  VC  classes.  However,  we  prove  that 
in  Bayesian  active  learning,  for  any  VC  class  and  prior,  there  is  a  prior-dependent  method  that 
always  achieves  an  expected  label  complexity  that  is  o(l/e).  Thus,  this  represents  a  concrete 
result  on  the  advantages  of  having  access  to  the  target’s  prior  distribution. 

1.2  Active  Testing 

One  of  the  major  challenges  facing  active  learning  is  that  of  model  selection.  Specifically,  given 
a  number  of  hypothesis  classes,  how  does  one  decide  which  one  to  use?  In  passive  learning,  the 
solution  is  simple:  try  them  all,  and  then  pick  from  among  the  resulting  hypotheses  using  cross- 
validation.  But  such  solutions  are  not  available  to  active  learning,  since  the  methods  tailored  to 
each  hypothesis  class  will  generally  make  very  different  label  requests,  so  that  the  label  com¬ 
plexity  of  producing  a  hypothesis  from  all  of  the  classes  is  close  to  the  sum  of  their  individual 
label  complexities. 

Thus,  to  avoid  this  problem,  there  is  a  need  for  procedures  that  quickly  dermine  whether  the 
target  concept  is  within  (or  approximated  by)  a  given  concept  class,  by  asking  a  much  smaller 
number  of  label  requests  than  required  for  learning  with  that  class:  that  is,  for  testing  methods 
that  operate  in  the  active  learning  protocol,  which  we  therefore  refer  to  as  active  testing.  This 
way,  we  can  simply  go  through  each  class  and  test  whether  the  target  is  in  the  class  or  not,  and 
only  run  the  full  learning  method  on  some  simplest  class  that  passes  the  test.  The  questions  then 
become  how  many  fewer  queries  are  required  for  testing  compared  to  learning,  as  this  quantifies 
the  savings  from  using  this  approach.  Following  the  traditional  literature  on  property  testing, 
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the  primary  focus  of  such  an  analysis  is  on  the  dependence  of  the  query  complexity  on  the  VC 
dimension  of  the  hypothesis  class  being  tested.  Since  learning  typically  required  a  number  of 
queries  linear  in  the  VC  dimension,  a  sublinear  dependence  is  considered  an  improvement,  while 
a  query  complexity  independent  of  the  VC  dimension  is  considered  superb. 

There  is  much  existing  literature  on  property  testing.  However,  the  standard  model  of  prop¬ 
erty  testing  makes  use  of  membership  queries ,  which  are  effectively  label  requests  for  feature 
vectors  of  our  own  construction,  rather  than  feature  vectors  from  a  given  polynomial-sized  sam¬ 
ple  of  unlabeled  examples  from  the  data  distribution.  Such  methods  are  unrealistic  for  our  model 
selection  purposes,  since  it  is  well-known  in  the  machine  learning  community  that  the  feature 
vectors  constructed  by  membership  queries  are  often  unintelligible  by  the  human  experts  charged 
with  labeling  the  examples.  However,  the  results  from  this  literature  on  membership  queries  do 
provide  us  a  useful  additional  reference  point,  since  we  are  certain  that  the  query  complexity  of 
active  testing  is  no  smaller  than  that  of  testing  with  membership  queries,  and  no  larger  than  that 
of  testing  from  random  labeled  examples  (passive  testing). 

In  our  work  on  active  testing,  we  study  a  number  of  interesting  concept  classes,  and  find 
that  in  some  cases  the  query  complexity  is  nearly  the  same  as  that  of  testing  with  membership 
queries,  while  other  times  it  is  closer  to  that  of  passive  testing.  However,  in  most  (though  not  all) 
cases,  we  do  find  that  the  query  complexity  of  active  testing  is  significantly  smaller  than  that  of 
active  learning ,  so  that  this  approach  to  model  selection  can  indeed  be  quite  effective  at  reducing 
the  total  query  complexity. 

1.3  Theory  of  Transfer  Learning 

Given  the  positive  results  mentioned  above  on  the  advantages  of  active  learning  with  access  to 
the  target’s  prior  distribution,  the  next  natural  quesiton  is,  “How  does  one  gain  access  to  the 
target’s  prior  distribution?”  Traditionally,  there  have  been  a  variety  of  answers  to  this  question 
given  by  the  Bayesian  Statistics  community,  ranging  from  subjective  beliefs,  to  computationally- 
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motivated  assumptions,  to  estimation.  Perhaps  one  of  the  most  appealing,  from  a  practical  per¬ 
spective,  it  the  Empirical  Bayes  perspective,  which  says  that  we  gain  access  to  an  approximation 
of  the  prior  based  on  analysis  of  past  experience.  In  the  learning  context,  this  idea  of  gaining  in¬ 
sights  for  a  new  learning  problem,  based  on  experience  with  past  learning  problems,  goes  by  the 
name  Transfer  Learning.  The  specific  model  of  transfer  learning  relevant  to  this  Empirical  Bayes 
setting  is  the  following.  We  suppose  that  we  are  tasked  with  a  sequence  of  T  learning  problems, 
or  tasks.  For  each  task,  the  unlabeled  data  are  sampled  i.i.d.  according  to  some  distribution  D, 
independently  across  the  tasks.  Furthermore,  for  each  task  the  target  function  is  sampled  accord¬ 
ing  to  some  prior  distribution  it,  again  independently  across  tasks.  We  then  approach  each  task  as 
usual,  making  a  number  of  label  requests  and  then  halting  with  guaranteed  expected  error  rate  at 
most  e.  The  hope  is  that,  after  solving  a  number  of  learning  problems  t  <  T,  the  label  complexity 
of  solving  task  t  +  1  should  be  smaller  than  that  of  solving  the  first  task,  due  to  gaining  some 
information  about  the  distribution  it. 

The  challenge  in  this  problem  is  that  we  do  not  get  direct  observations  of  the  target  functions 
from  each  task.  Rather,  we  may  only  observe  a  small  number  of  labeled  examples.  So  the 
question  is  how  to  extract  useful  information  about  ir  from  these  limited  observations.  This 
situation  is  further  complicated  by  the  fact  that  we  are  interested  in  minimizing  the  number  of 
samples  per-task,  and  that  the  active  learning  method's  queries  might  be  highly  task-specific. 
Indeed,  in  many  transfer  learning  settings,  each  task  is  approached  by  a  different  agent,  who  may 
be  non-altruistic  with  respect  to  the  other  agents;  thus,  she  may  be  unwilling  to  make  very  many 
additional  label  requests  merely  to  aid  the  learners  that  will  solve  future  tasks. 

In  our  work,  we  show  that  it  is  possible  to  gain  benefits  from  transfer  learning,  while  limiting 
the  number  of  additional  queries  (other  than  those  used  directly  for  learning)  required  from  each 
task.  Specifically,  we  use  a  number  of  extra  queries  per  task  equal  the  VC  dimension  of  the 
concept  class.  Using  these  queries,  we  are  able  to  consistently  estimate  it,  assuming  only  that 
it  resides  in  a  known  totally  bounded  class  of  distributions.  We  are  then  able  to  use  this  esti- 
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mate  in  the  context  of  a  prior-dependent  learning  method  to  asymptotically  achieve  an  average 
label  complexity  equal  to  that  of  learning  with  direct  knowledge  of  n.  Thus,  we  have  realized 
the  aforementioned  benefits  of  having  knowledge  of  the  target’s  prior,  including  the  guaranteed 
o(l/e)  expected  label  complexity  for  self- verifying  active  learning.  We  further  show  that  no 
method  taking  fewer  than  VC  dimension  number  of  samples  per  task  can  match  this  guarantee  at 
this  level  of  generality. 

Interestingly,  under  smoothness  conditions  on  7 r,  we  also  provide  explicit  bounds  on  the  rate 
of  convergence  of  our  estimator  to  7r,  and  we  additionally  derive  lower  bounds  on  the  minimax 
rate  of  convergence.  This  has  implications  for  non-asymptotic  guarantees  on  the  benefits  of 
transfer  learning. 

We  also  extend  these  results  to  real-valued  functions,  where  the  VC  dimension  is  replaced 
by  the  pseudo-dimension  of  the  function  class.  In  addition  to  transfer  learning,  we  also  find  that 
this  technique  for  estimating  a  prior  distribution  over  real-valued  functions  has  applications  to 
the  preference  elicitation  problem  in  a  certain  type  of  combinatorial  auction. 

1.4  Active  Learning  with  Drifting  Distributions  and  Targets 

In  addition  to  the  work  on  Bayesian  active  learning,  I  have  additionally  studied  the  setting  of 
active  learning  without  access  to  a  prior.  Work  in  this  area  is  presently  more  mature,  so  that 
there  are  known  methods  that  are  robust  to  noise,  and  have  well-understood  label  complexities. 
However,  all  of  the  previous  theoretical  work  on  active  learning  supposed  the  data  were  sampled 
i.i.d.  from  some  fixed  (though  unknown)  distribution.  But  many  realistic  applications  of  active 
learning  involve  distributions  that  change  over  time,  so  that  we  require  some  understanding  of 
how  active  learning  methods  behave  under  drifting  distributions. 

In  my  work  on  this  topic,  I  study  a  model  of  distribution  drift  in  which  the  conditional  distri¬ 
bution  of  label  given  features  remains  fixed  (i.e.,  no  target  drift),  while  the  marginal  distribution 
over  the  feature  vectors  can  change  arbitrarily  within  a  given  totally  bounded  family  of  distribu- 
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tions  from  one  observation  to  the  next.  I  then  analyze  a  stream-based  active  learning  setting,  in 
which  the  learner  is  at  each  time  required  to  make  a  prediction  for  the  label  of  a  new  example, 
and  then  decide  whether  to  request  the  label  or  not.  We  are  then  interested  in  the  expected  num¬ 
ber  of  mistakes  and  number  of  label  requests,  as  a  function  of  how  many  data  points  have  been 
observed. 

Interestingly,  I  find  that  even  with  such  drifting  distributions,  it  is  still  possible  to  guarantee 
a  number  of  mistakes  on  par  with  fully-supervised  learning,  while  only  requesting  a  sublinear 
number  of  labels,  as  long  as  the  disagreement  coefficient  is  sublinear  in  the  reciprocal  of  its 
argument  under  all  distributions  in  the  given  family.  I  prove  this,  both  under  the  realizable  case, 
and  under  Tsybakov  noise  conditions.  I  further  provide  a  more  detailed  analysis  of  the  frequency 
of  label  requests  and  mistakes,  as  a  function  of  the  Tsybakov  noise  parameters,  the  supremum  of 
the  disagreement  coefficient  over  the  given  family  of  distributions,  and  the  covering  numbers  of 
the  family  of  distributions.  To  complement  this,  I  also  provide  lower  bounds  on  the  number  of 
label  requests  required  of  any  active  learning  method  whose  number  of  mistakes  is  on  par  with 
the  optimal  performance  of  fully-supervised  learning. 

We  have  also  studied  the  related  problem  of  active  learning  with  a  drifting  target  concept,  in 
which  the  target  function  itself  changes  over  time.  In  this  setting,  the  distribution  over  unlabeled 
samples  remains  fixed,  while  the  function  providing  labels  changes  over  time  at  a  specified  rate. 
We  then  express  bounds  on  the  expected  number  of  mistakes  and  queries,  as  a  function  of  this 
rate  of  change  and  the  number  of  samples. 

In  any  learning  context,  the  problem  of  efficient  learning  in  the  presence  of  noise  is  a  constant 
challenge.  Toward  addressing  this  challenge,  we  have  proposed  an  active  learning  algorithm  that 
makes  use  of  a  convex  surrogate  loss  function,  in  place  of  the  0-1  loss,  while  still  providing 
guarantees  on  the  obtained  error  rate  (under  the  0-1  loss)  and  number  of  queries  made  in  the 
active  learning  context,  under  the  assumption  that  the  surrogate  loss  is  classification-calibrated, 
and  the  minimizer  of  the  surrogate  loss  resides  in  the  function  class  used  by  the  algorithm. 
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1.5  Efficiently  Learning  DNF  with  Representation- Specific  Queries 


In  addition  to  the  basic  active  learning  protocol,  based  on  label  requests,  we  have  also  studied 
an  interesting  new  type  of  learning  protocol,  in  which  the  algorithm  is  allowed  queries  regarding 
specific  aspects  of  the  representation  of  the  target  function.  This  setting  is  motivated  by  appli¬ 
cations  in  which  there  are  essentially  sub-labels  for  the  examples,  which  may  be  difficult  for  an 
expert  to  explicitly  produce,  but  for  which  they  can  easily  recognize  commonality.  For  instance, 
in  fraud  detection,  we  may  be  able  to  ask  an  expert  whether  two  given  examples  of  fraudulent 
transactions  are  representative  of  the  same  type  of  fraud. 

To  study  this  idea  in  formality,  we  specifically  look  at  the  classic  problem  of  efficiently 
learning  a  DNF  formula.  Certain  variants  of  this  problem  are  known  to  be  NP-Hard  if  we  are 
permitted  only  labeled  data  (e.g.,  proper  learning),  and  there  are  no  known  efficient  methods  for 
the  general  problem  of  learning  DNF,  even  with  membership  queries.  In  fact,  under  the  uniform 
distribution,  there  are  no  such  general  results  known  even  for  the  problem  of  learning  monotone 
DNF  from  labeled  data  alone.  Thus,  there  is  a  real  need  for  new  ideas  to  approach  the  problem 
of  learning  DNF  if  the  class  of  DNF  functions  is  to  be  used  for  practical  applications. 

In  our  work,  we  suppose  access  to  a  polynomial-sized  sample  of  labeled  examples,  and  for 
any  pair  of  positive  examples  from  that  sample,  we  allow  queries  of  the  type,  “Do  these  two 
examples  satisfy  a  term  in  common  in  the  target  DNF?”  It  turns  out  that  the  problem  of  learning 
arbitrary  DNF  under  arbitrary  distributions  is  no  easier  with  this  type  of  query  than  with  labeled 
examples  alone.  However,  using  queries  of  this  type,  we  are  able  to  efficiently  learn  several 
interesting  sub-families  of  DNF,  including  solving  some  problems  known  to  be  NP-Hard  from 
labeled  data  alone  (properly  learning  2-term  DNF).  Additionally,  under  the  uniform  distribu¬ 
tion,  we  find  many  more  interesting  families  of  DNF  that  are  efficient  leamable  with  queries  of 
this  type,  including  the  well-studied  family  of  0(log(n))-juntas,  and  any  DNF  for  which  each 
variable  appears  in  at  most  0(log(n))  terms. 

We  further  study  several  generalizations  of  this  type  of  query.  In  particular,  if  we  allow  the 
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algorithm  to  ask  “How  many  terms  do  these  two  examples  satisfy  in  common  in  the  target  DNF?” 
then  we  can  significantly  broaden  the  collection  of  subfamilies  of  DNF  that  are  efficiently  learn- 
able.  In  particular,  0(log(n))-juntas  become  efficiently  leamable  under  arbitrary  distributions, 
as  does  the  family  of  DNF  with  0(log(n))  terms. 

With  a  further  strengthening  to  allow  the  query  to  involve  an  arbitrary  number  of  examples, 
rather  than  just  two,  we  find  we  can  efficiently  (properly)  learn  an  arbitrary  DNF  under  an  arbi¬ 
trary  distribution.  This  is  also  the  case  if  we  restrict  to  just  two  examples  in  the  query,  but  we 
allow  the  algorithm  to  construct  the  feature  vectors  for  those  two  examples,  rather  than  selecting 
them  from  a  polynomial-sized  sample. 

Overall,  we  feel  this  is  an  important  topic,  in  that  it  makes  real  progress  on  the  practically- 
important  problem  of  efficiently  learning  DNF,  which  has  otherwise  been  essentially  stagnant  for 
a  number  of  years. 

1.6  Online  Allocation  with  Economies  of  Scale 

In  addition  to  all  of  the  above  work  on  computational  learning  theory,  this  dissertation  also  in¬ 
cludes  work  on  allocations  problems  in  which  the  cost  of  allocating  each  additional  copy  of  a 
good  is  decreasing  in  the  number  of  copies  already  allocated.  This  model  captures  the  natural 
economies  of  scale  that  arise  in  many  real-world  contexts.  In  this  context,  we  derive  meth¬ 
ods  capable  of  allocating  goods  to  a  set  of  customers  in  a  unit-demand  setting,  while  achieving 
near-optimal  cost  guarantees.  We  study  this  problem  both  in  an  offline  setting,  in  which  all  of 
the  customer  valuation  functions  are  known  in  advance,  and  also  in  a  type  of  online  setting,  in 
which  the  customers  arrive  one-at-a-time,  so  that  we  do  not  know  in  advance  what  their  valuation 
functions  will  be.  In  the  online  variant  of  the  problem,  working  under  the  assumption  that  the 
valuation  functions  are  i.i.d.  samples,  we  make  use  of  generalization  guarantees  from  statistical 
learning  theory,  in  combination  to  the  algorithmic  solutions  to  the  offline  problem,  to  obtain  the 
approximation  guarantees. 
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Chapter  2 


Active  Testing 


Abstract 

1  One  of  the  motivations  for  property  testing  of  boolean  functions  is  the  idea  that  testing  can 
serve  as  a  preprocessing  step  before  learning.  However,  in  most  machine  learning  applications, 
the  ability  to  query  functions  at  arbitrary  points  in  the  input  space  is  considered  highly  unrealistic. 
Instead,  the  dominant  query  paradigm  in  applied  machine  learning,  called  active  learning,  is  one 
where  the  algorithm  may  ask  for  examples  to  be  labeled,  but  only  from  among  those  that  exist 
in  nature.  That  is,  the  algorithm  may  make  a  polynomial  number  of  draws  from  the  underlying 
distribution  D  and  then  query  for  labels,  but  only  of  points  in  its  sample.  In  this  work,  we  bring 
this  well-studied  model  in  learning  to  the  domain  of  testing,  calling  it  active  testing. 

We  show  that  for  a  number  of  important  properties,  testing  can  still  yield  substantial  benefits 
in  this  setting.  This  includes  testing  unions  of  intervals,  testing  linear  separators,  and  testing 
various  assumptions  used  in  semi-supervised  learning.  For  example,  we  show  that  testing  unions 
of  d  intervals  can  be  done  with  0(1)  label  requests  in  our  setting,  whereas  it  is  known  to  require 
f 1(y/d)  labeled  examples  for  passive  testing  (where  the  algorithm  must  pay  for  labels  on  every 
example  drawn  from  D )  and  Hid)  for  learning.  In  fact,  our  results  for  testing  unions  of  intervals 

'joint  work  with  Maria-Florina  Balcan,  Eric  Blais,  and  Avrim  Blum. 
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also  yield  improvements  on  prior  work  in  both  the  membership  query  model  (where  any  point 
in  the  domain  can  be  queried)  and  the  passive  testing  model  [Kearns  and  Ron,  2000]  as  well.  In 
the  case  of  testing  linear  separators  in  Rn,  we  show  that  both  active  and  passive  testing  can  be 
done  with  0(y/n)  queries,  substantially  less  than  the  Q(n)  needed  for  learning  and  also  yielding 
a  new  upper  bound  for  the  passive  testing  model.  We  also  show  a  general  combination  result  that 
any  disjoint  union  of  testable  properties  remains  testable  in  the  active  testing  model,  a  feature 
that  does  not  hold  for  passive  testing. 

In  addition  to  these  specific  results,  we  also  develop  a  general  notion  of  the  testing  dimension 
of  a  given  property  with  respect  to  a  given  distribution.  We  show  this  dimension  characterizes 
(up  to  constant  factors)  the  intrinsic  number  of  label  requests  needed  to  test  that  property;  we  do 
this  for  both  the  active  and  passive  testing  models.  We  then  use  this  dimension  to  prove  a  number 
of  lower  bounds.  For  instance,  interestingly,  one  case  where  we  show  active  testing  does  not  help 
is  for  dictator  functions,  where  we  give  O (log  n)  lower  bounds  that  match  the  upper  bounds  for 
learning  this  class. 

Our  results  show  that  testing  can  be  a  powerful  tool  in  realistic  models  for  learning,  and 
further  that  active  testing  exhibits  an  interesting  and  rich  structure.  Our  work  in  addition  develops 
new  characterizations  of  common  function  classes  that  may  be  of  independent  interest. 

2.1  Introduction 

One  of  the  motivations  for  property  testing  of  boolean  functions  is  the  idea  that  testing  can  serve 
as  a  preprocessing  step  before  learning  -  to  determine  whether  learning  with  a  given  hypothesis 
class  is  worthwhile  [Goldreich,  Goldwasser,  and  Ron,  1998].  Indeed,  query-efficient  testers  have 
been  designed  for  many  common  hypothesis  classes  in  machine  learning  such  as  linear  thresh¬ 
old  functions  [Matulef,  O’Donnell,  Rubinfeld,  and  Servedio,  2009],  unions  of  intervals  [Kearns 
and  Ron,  2000],  juntas  [Blais,  2009,  Fischer,  Kindler,  Ron,  Safra,  and  Samorodnitsky,  2004], 
DNFs  [Diakonikolas,  Lee,  Matulef,  Onak,  Rubinfeld,  Servedio,  and  Wan,  2007],  and  decision 
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trees  [Diakonikolas,  Lee,  Matulef,  Onak,  Rubinfeld,  Servedio,  and  Wan,  2007],  (See  Ron’s 
survey  [Ron,  2008]  for  much  more  on  the  connection  between  learning  and  property  testing.) 

Most  property  testing  algorithms,  however,  rely  on  the  ability  to  query  functions  on  arbitrary 
points  -  an  assumption  that  is  unrealistic  in  most  machine  learning  applications.  For  example, 
in  classifying  documents  by  topic,  while  selecting  an  existing  document  on  the  web  and  asking 
a  user  “is  this  about  sports  or  business?”  may  make  perfect  sense,  taking  an  existing  sports 
document  (represented  in  Rn  as  a  vector  of  word-counts),  corrupting  a  random  fraction  of  the 
entries,  and  asking  “is  this  still  about  sports?”  does  not.  Early  experiments  yielded  similar 
failures  for  membership-query  learning  algorithms  in  vision  applications  when  asking  human 
users  about  corrupted  images  [Baum  and  Lang,  1993].  As  a  result,  the  dominant  query  paradigm 
in  machine  learning  has  instead  been  the  model  of  active  learning  where  the  algorithm  may 
query  for  labels  of  examples  of  its  choosing,  but  only  among  those  that  exist  in  nature  [Balcan, 
Beygelzimer,  and  Langford,  2006,  Balcan,  Broder,  and  Zhang,  2007a,  Balcan,  Hanneke,  and 
Wortman,  2008,  Beygelzimer,  Dasgupta,  and  Langford,  2009,  Castro  and  Nowak,  2007,  Cohn, 
Atlas,  and  Ladner,  1994a,  Dasgupta,  2005,  Dasgupta,  Hsu,  and  Monteleoni,  2007b,  Hanneke, 
2007a,  Seung,  Opper,  and  Sompolinsky,  1992,  Tong  and  Roller.,  2001]. 

In  this  work,  we  bring  this  well-studied  model  in  learning  to  the  domain  of  testing.  In  par¬ 
ticular,  we  assume  that  as  in  active  learning,  our  algorithm  can  make  a  polynomial  number  of 
draws  of  unlabeled  examples  from  the  underlying  distribution  D  (these  unlabeled  examples  are 
viewed  as  cheap),  and  then  can  make  a  small  number  of  label  queries  but  only  over  the  unlabeled 
examples  drawn  (these  label  queries  are  viewed  as  expensive).  The  question  we  ask  is  whether 
testing  in  this  setting  is  sufficient  to  still  yield  significant  benefit  in  terms  of  label  requests  over 
the  number  of  labeled  examples  needed  for  learning. 

What  we  show  is  that  for  a  number  of  interesting  properties  relevant  to  learning,  this  capa¬ 
bility  indeed  allows  for  a  substantial  reduction  in  the  number  of  labels  required.  This  includes 
testing  unions  of  intervals,  testing  linear  separators,  and  testing  various  assumptions  about  the 
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separation  of  data  used  in  semi-supervised  learning.  For  example,  we  show  that  testing  unions 
of  d  intervals  can  be  done  with  0(1)  label  requests  in  our  setting,  whereas  it  is  known  to  require 
£l(y/d)  labeled  examples  for  passive  testing  (where  the  algorithm  must  pay  for  labels  on  every 
example  drawn  from  D )  and  Q(d)  for  learning.  In  the  case  of  testing  linear  separators  in  R"  , 
we  show  that  both  active  and  passive  testing  can  be  done  with  0{y/n)  queries,  substantially  less 
than  the  Q(n)  needed  for  learning  and  also  yielding  a  new  upper  bound  for  the  passive  testing 
model  as  well.  These  results  use  a  generalization  of  Arcones  Theorem  on  the  concentration  of 
U-statistics.  For  the  case  of  unions  of  intervals,  our  results  even  improve  on  prior  work  in  the 
membership  query  and  passive  models  of  testing  [Kearns  and  Ron,  2000],  and  are  based  on  a 
characterization  of  this  class  in  terms  of  noise  sensitivity  that  may  be  of  independent  interest. 
We  also  show  that  any  disjoint  union  of  testable  properties  remains  testable  in  the  active  testing 
model,  allowing  one  to  build  testable  properties  out  of  simpler  components;  this  is  a  feature  that 
does  not  hold  for  passive  testing. 

In  addition  to  the  above  results,  we  also  develop  a  general  notion  of  the  testing  dimension  of  a 
given  property  with  respect  to  a  given  distribution.  We  show  this  dimension  characterizes  (up  to 
constant  factors)  the  intrinsic  number  of  label  requests  needed  to  test  that  property;  we  do  this  for 
both  passive  and  active  testing  models.  We  then  make  use  of  this  notion  of  dimension  to  prove 
a  number  of  lower  bounds.  For  instance,  one  interesting  case  where  we  show  active  testing  does 
not  help  is  for  dictator  functions,  a  classic  property  where  membership  queries  can  allow  testing 
with  0(1)  label  requests,  but  where  we  show  active  testing  requires  O(logn)  labels,  matching 
the  bounds  for  learning. 

Our  results  show  that  a  number  of  important  properties  for  learning  can  be  tested  with  a 
small  number  of  label  requests  in  a  realistic  model,  and  furthermore  that  active  testing  exhibits 
an  interesting  and  rich  structure.  We  further  point  out  that  unlike  the  case  of  passive  learning, 
there  are  no  known  strong  Structural  Risk  Minimization  bounds  for  active  learning,  which  makes 
the  use  of  testing  in  this  setting  even  more  compelling.2  Our  techniques  are  quite  different  from 
2In  passive  learning,  if  one  has  a  collection  of  algorithms  or  hypothesis  classes  to  try,  there  is  little  advantage 
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those  used  in  the  active  learning  literature. 


2.1.1  The  Active  Property  Testing  Model 

Before  discussing  our  results  in  more  detail,  let  us  first  introduce  the  model  of  active  testing.  A 
property  V  of  boolean  functions  is  simply  a  subset  of  all  boolean  functions.  We  will  also  refer 
to  properties  as  classes  of  functions.  The  distance  of  a  function  /  to  the  property  V  over  a  distri¬ 
bution  D  on  the  domain  of  the  function  is  di  stiff .  V)  :=  minfle-p  Pr x~D\f(x)  f  fj(-J')].  A  tester 
for  V  is  a  randomized  algorithm  that  must  distinguish  (with  high  probability)  between  functions 
in  V  and  functions  that  are  far  from  V.  In  the  standard  property  testing  model  introduced  by 
Rubinfeld  and  Sudan  [Rubinfeld  and  Sudan,  1996],  a  tester  is  allowed  to  query  the  value  of  the 
function  on  any  input  in  order  to  make  this  decision.  We  consider  instead  a  model  in  which  we 
add  restrictions  to  the  possible  queries: 

Definition  2.1  (Property  tester).  An  s-sample,  g-query  e-tester  for  V  over  the  distribution  D  is  a 
randomized  algorithm  A  that  draws  s  samples  from  D,  sequentially  queries  for  the  value  of  f  on 
q  of  those  samples,  and  then 

1.  Accepts  w.p.  at  least  |  when  f  e  V,  and 

2.  Rejects  w.p.  at  least  |  when  distD(f ,  V)  >  e. 

We  will  use  the  terms  “label  request”  and  “query”  interchangeably.  Definition  2. 1  coincides 
with  the  standard  definition  of  property  testing  when  the  number  of  samples  is  unlimited  and  the 
distribution’s  support  covers  the  entire  domain.  In  the  other  extreme  case  where  we  fix  q  =  s,  our 
definition  then  corresponds  to  the  passive  testing  model,  where  the  inputs  queried  by  the  tester 
are  sampled  from  the  distribution.  Finally,  by  setting  s  to  be  polynomial  in  some  appropriate 
measure  of  the  input  domain,  we  obtain  the  active  testing  model  that  is  the  focus  of  this  paper: 

asymptotically  to  being  told  which  of  these  is  best  in  advance,  since  one  can  simply  apply  all  of  them  and  use  an 
appropriate  union  bound.  In  contrast,  this  is  much  less  clear  for  active  learning  algorithms  that  each  might  ask  for 
labels  on  different  examples. 
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Definition  2.2  (Active  tester).  A  randomized  algorithm  is  a  7-query  active  e-tester  for  V  C 
{0,  l}n  — »  {0, 1}  over  D  if  it  is  a  poly  (n) -sample,  q-query  e-tester  for  V  over  D. 

Remark  2.1.  We  emphasize  that  the  name  active  tester  is  chosen  to  reflect  the  connection  with 
active  learning.  It  is  not  meant  to  imply  that  this  model  of  testing  is  somehow  “more  active  ”  than 
the  standard  property  testing  model. 

In  some  cases,  the  domain  of  our  functions  is  not  {0,  l}n.  In  those  cases,  we  require  s  to  be 
polynomial  in  some  other  appropriate  measure  of  complexity  that  we  specify  explicitly. 

Note  that  in  Definition  2.1,  since  we  do  not  have  direct  membership  query  access  (at  arbitrary 
points),  our  tester  must  accept  w.p.  at  least  |  when  /  is  such  that  distoif,  'P)  —  0’  even  if  /  does 
not  satisfy  V  over  the  entire  input  space.  This,  in  fact,  is  one  crucial  difference  between  our 
model  and  the  distribution-free  testing  model  introduced  by  Halevy  and  Kushilevitz  [Halevy  and 
Kushilevitz,  2007]  and  further  studied  in  [Dolev  and  Ron,  2010,  Glasner  and  Servedio,  2009, 
Halevy  and  Kushilevitz,  2004,  2005].  In  the  distribution-free  model,  the  tester  can  sample  inputs 
from  some  unknown  distribution  and  can  query  the  target  function  on  any  input  of  its  choosing. 
It  must  then  distinguish  between  the  case  where  /  £  V  from  the  case  where  /  is  far  from  the 
property  over  the  distribution.  Most  testers  in  this  model  strongly  rely  on  the  ability  to  query  any 
input3  and,  therefore,  these  algorithms  are  not  valid  active  testers. 

In  fact,  the  case  of  dictator  functions,  functions  /  :  {0,  l}n  — >  {0, 1}  such  that  f(x)  =  Xi 
for  some  i  e  [n] ,  helps  to  illustrate  the  distinction  between  active  testing  and  the  standard 
(membership  query)  testing  model.  The  dictatorship  property  is  testable  with  0(1)  member¬ 
ship  queries  [Bellare,  Goldreich,  and  Sudan,  1998,  Parnas,  Ron,  and  Samorodnitsky,  20031.  In 
contrast,  with  active  testing,  the  query  complexity  is  the  same  as  needed  for  learning: 

Theorem  2.3.  Active  testing  of  dictatorships  under  the  uniform  distribution  requires  fl(log  n) 
queries.  This  holds  even  for  distinguishing  dictators  from  random  functions. 

3Indeed,  Halevy  and  Kushilevitz’s  original  motivation  for  introducing  the  model  was  to  better  model  PAC  learn¬ 
ing  in  the  membership  query  model  [Halevy  and  Kushilevitz,  2007]. 
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This  result,  which  we  prove  in  Section  2.5.1  as  an  application  of  the  active  testing  dimension 
defined  in  Section  2.5,  points  out  that  the  constraints  imposed  by  active  testing  present  real 
challenges.  Nonetheless,  we  show  that  for  a  number  of  interesting  properties  we  can  indeed 
perform  active  testing  with  substantially  fewer  queries  than  needed  for  learning  or  passive  testing. 
In  some  cases,  we  will  even  provide  improved  bounds  for  passive  testing  in  the  process  as  well. 

2.1.2  Our  Results 

We  have  two  types  of  results.  Our  first  results,  on  the  testability  of  unions  of  intervals  and  linear 
threshold  functions,  show  that  it  is  indeed  possible  to  test  properties  of  interest  to  the  learning 
community  efficiently  in  the  active  model.  Our  next  results,  concerning  the  testing  of  disjoint 
unions  of  properties  and  a  new  notion  of  testing  dimension,  examine  the  active  testing  model 
from  a  more  abstract  point  of  view.  We  describe  these  results  and  some  of  their  applications 
below. 

Testing  Unions  of  Intervals.  The  function  /  :  [0, 1]  — >  {0, 1}  is  a  union  of  d  intervals  if  there 
are  at  most  d  non-overlapping  intervals  (E\,  Mi), . . . ,  (Ed,  Ud )  such  that  f(x)  —  1  iff  Ei  <  x  <  Ui 
for  some  i  G  [cZ] .  The  VC  dimension  of  this  class  is  2d,  so  learning  a  union  of  d  intervals  requires 
at  least  Q(d)  queries.  By  contrast,  we  show  that  testing  unions  of  d  intervals  can  be  done  with  a 
number  of  label  requests  that  is  independent  of  d,  for  any  distribution  I): 

Theorem  2.4.  Testing  unions  of  d  intervals  in  the  active  testing  model  can  be  done  using  only 
0(  1/e3)  queries.  In  the  case  of  the  uniform  distribution,  we  further  need  only  0(\/d/e5)  unla¬ 
beled  examples. 

We  note  that  Theorem  2.4  not  only  gives  the  first  result  for  testing  unions  of  intervals  in  the 
active  testing  model,  but  it  also  improves  on  the  previous  best  results  for  testing  this  class  in  the 
membership  query  and  passive  models.  Previous  testers  used  0(1)  queries  in  the  membership 
query  model  and  0(\fd)  samples  in  the  passive  model,  but  applied  only  to  a  relaxed  setting 
in  which  only  functions  that  were  e  far  from  unions  of  d'  =  d/e  intervals  had  to  be  rejected 
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with  high  probability  [Kearns  and  Ron,  2000].  Our  tester  immediately  yields  the  same  query 
bound  as  a  function  of  d  (active  testing  with  ()(\fd)  unlabeled  examples  directly  implies  passive 
testing  with  0(\fd)  labeled  examples)  but  rejects  any  function  that  is  e-far  from  unions  of  d'  =  d 
intervals.  Note  also  that  Kearns  and  Ron  [Kearns  and  Ron,  2000]  show  that  Vt{\fd)  samples  are 
required  to  test  unions  of  d  intervals  in  the  passive  model,  and  so  our  bound  on  the  number  of 
unlabeled  examples  in  Theorem  2.4  is  optimal  in  terms  of  d. 

The  proof  of  Theorem  2.4  relies  on  a  new  noise  sensitivity  characterization  of  the  class  of 
unions  of  d  intervals.  That  is,  we  show  that  all  unions  of  d  intervals  have  low  noise  sensitivity 
while  all  functions  that  are  far  from  this  class  have  noticeably  larger  noise  sensitivity  and  intro¬ 
duce  a  tester  that  estimates  the  noise  sensitivity  of  the  input  function.  We  describe  these  results 
in  Section  2.2. 

Testing  Linear  Threshold  Functions.  We  next  study  the  problem  of  testing  linear  threshold 
functions  (or  LTFs),  namely  the  class  of  boolean  functions  /  :  Rn  — >  {0, 1}  of  the  form  f(x)  = 

sgnfwi^i  + - b  wnxn  —  0)  where  w\, . . . ,  wn,  9  e  R.  LTFs  can  be  tested  with  0(1)  queries  in 

the  membership  query  model  [Matulef,  O’Donnell,  Rubinfeld,  and  Servedio,  2009].  While  we 
show  this  is  not  possible  in  the  active  testing  model,  we  nonetheless  show  we  can  substantially 
improve  over  the  number  of  label  requests  needed  for  learning.  In  particular,  learning  LTFs 
requires  O(n)  labeled  examples,  even  over  the  Gaussian  distribution  [Long,  1995].  We  show 
that  the  query  and  sample  complexity  for  testing  LTFs  is  significantly  better: 

Theorem  2.5.  We  can  efficiently  test  LTFs  under  the  Gaussian  distribution  with  0(yjn)  labeled 
examples  in  both  active  and  passive  testing  models.  Furthermore ,  we  have  lower  bounds  of 
Q(al/3)  and  id  ( y/ri)  on  the  number  of  labels  needed  for  active  and  passive  testing  respectively. 

The  proof  of  the  upper  bound  in  the  theorem  relies  on  a  recent  characterization  of  LTFs  by  the 
Hermite  weight  distribution  of  the  function  [Matulef,  O'Donnell,  Rubinfeld,  and  Servedio,  2009] 
as  well  as  a  new  concentration  of  measure  result  for  U-statistics.  The  proof  of  the  lower  bound 
involves  analyzing  the  distance  between  the  label  distribution  of  an  LTF  formed  by  a  Gaussian 
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weight  vector  and  the  label  distribution  of  a  random  noise  function.  See  Section  2.3  for  details. 


Testing  Disjoint  Unions  of  Testable  Properties.  Given  a  collection  of  properties  V%,  a  natural 
way  to  combine  them  is  via  their  disjoint  union.  E.g.,  perhaps  our  data  falls  into  N  well-separated 
regions,  and  while  we  suspect  our  data  overall  may  not  be  linearly  separable,  we  believe  it  may 
be  linearly  separable  (by  a  different  separator)  in  each  region.  We  show  that  if  each  individual 
property  Vi  is  testable  (in  this  case,  V,  is  the  LTF  property)  then  their  disjoint  union  V  is  testable 
as  well,  with  only  a  very  small  increase  in  the  total  number  of  queries.  It  is  worth  noting  that  this 
property  does  not  hold  for  passive  testing.  We  present  this  result  in  Section  2.4,  and  use  it  inside 
our  testers  for  semi-supervised  learning  properties  discussed  below. 

Testing  Semi-Supervised  Learning  Assumptions.  Two  common  assumptions  considered  in 
semi-supervised  learning  [Chapelle,  Schlkopf,  and  Zien,  2006]  and  active  learning  [Dasgupta, 
2011]  are  (a)  if  data  happens  to  cluster  then  points  in  the  same  cluster  should  have  the  same  label, 
and  (b)  there  should  be  some  large  margin  7  of  separation  between  the  positive  and  negative 
region  (but  without  assuming  the  target  is  necessarily  a  linear  threshold  function).  Here,  we 
show  that  for  both  properties,  active  testing  can  be  done  with  0(1)  label  requests,  even  though 
these  classes  contain  functions  of  high  complexity  so  learning  (even  semi-supervised  or  active) 
requires  substantially  more  labeled  examples.  Our  results  for  the  margin  assumption  use  the 
cluster  tester  as  a  subroutine,  along  with  analysis  of  an  appropriate  weighted  graph  defined  over 
the  data.  We  present  our  results  in  Section  2.4  but  for  space  reasons,  defer  analysis  to  Appendix 
2.11. 

General  Testing  Dimensions.  We  develop  a  general  notion  of  the  testing  dimension  of  a  given 
property  with  respect  to  a  given  distribution.  We  do  this  for  both  passive  and  active  testing 
models.  We  show  these  dimensions  characterize  (up  to  constant  factors)  the  intrinsic  number  of 
label  requests  needed  to  test  the  given  property  with  respect  to  the  given  distribution  in  the  corre¬ 
sponding  model.  For  the  case  of  active  testing  we  also  provide  a  simpler  notion  that  characterizes 
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whether  testing  with  0(1)  label  requests  is  possible.  We  present  the  dimension  definitions  and 
analysis  in  Section  2.5. 

The  lower  bounds  in  this  paper  are  given  by  proving  lower  bounds  on  these  dimension  quan¬ 
tities.  In  Section  2.5.1,  we  prove  (as  mentioned  above)  that  for  the  class  of  dictator  functions, 
active  testing  cannot  be  done  with  fewer  queries  than  the  number  of  examples  needed  for  learn¬ 
ing,  even  for  the  problem  of  distinguishing  dictator  functions  from  truly  random  functions.  This 
result  additionally  implies  that  any  class  that  contains  dictator  functions  (and  is  not  so  large  as 
to  contain  almost  all  functions)  requires  fl(logn)  queries  to  test  in  the  active  model,  including 
decision  trees,  functions  of  low  Fourier  degree,  juntas,  DNFs,  etc.  In  Section  2.5.2,  we  complete 
the  proofs  of  the  lower  bounds  in  Theorem  2.5  on  the  number  of  queries  required  to  test  linear 
threshold  functions. 


2.2  Testing  Unions  of  Intervals 

In  this  section,  we  prove  Theorem  2.4  that  we  can  test  unions  of  d  intervals  in  the  active  testing 
model  using  only  0(l/e3)  label  requests,  and  furthermore,  over  the  uniform  distribution,  using 
only  0(\fd/eb)  unlabeled  samples.  We  begin  with  the  case  that  the  underlying  distribution  is 
uniform  over  [0,1],  and  afterwards  show  how  to  generalize  to  arbitrary  distributions.  Our  tester 
exploits  the  fact  that  unions  of  intervals  have  a  noise  sensitivity  characterization. 

Definition  2.6.  Fix  5  >  0.  The  local  5-noise  sensitivity  of  the  function  f  :  [0, 1]  — >■  {0, 1}  at 
x  G  [0, 1]  is  NS s(f,  x)  =  Pr y~sx[f(x)  ^  f{y)]i  where  y  ~<5  x  represents  a  draw  ofy  uniform  in 
{x  —  5,  x  +  5)  fl  [0, 1].  The  noise  sensitivity  of  f  is 

mu)  =  Pr  l/M  1  f(y)l 

x,y~$x 

or,  equivalently,  N§s{f)  —  ETNS 

A  simple  argument  shows  that  unions  of  d  intervals  have  (relatively)  low  noise  sensitivity: 
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Proposition  2.7.  Fix  5  >  0  and  let  f  :  [0, 1]  — *  {0, 1}  be  a  union  ofd  intervals.  Then  N  S$(f)  < 
dS. 

Proof  sketch.  Draw  x  G  [0, 1]  uniformly  at  random  and  y  ~ 5  x.  The  inequality  f(x)  f  f(y)  can 
only  hold  when  a  boundary  b  e  [0, 1]  of  one  of  the  d  intervals  in  /  lies  in  between  x  and  y.  For 
any  point  b  e  [0,1],  the  probability  that  x  <  b  <  y  or  y  <  b  <  x  is  at  most  f,  and  there  are  at 
most  2d  boundaries  of  intervals  in  /,  so  the  proposition  follows  from  the  union  bound.  □ 

Interestingly,  the  converse  of  the  proposition  statement  is  approximately  true:  for  5  small 
enough,  every  function  that  has  noise  sensitivity  not  much  larger  than  dS  is  close  to  being  a 
union  of  d  intervals.  (Full  proof  in  Appendix  2.7). 

Lemma  2.8.  Fix  <5  =  Let  f  :  [0, 1]  — >  {0, 1}  be  a  function  with  noise  sensitivity  bounded  by 
NSs(f)  <  dS(  1  +  |).  Then  f  is  e-close  to  a  union  ofd  intervals. 

Proof  outline.  The  proof  proceeds  in  two  steps.  First,  we  construct  a  function  g  :  [0, 1]  — >  {0, 1} 
that  is  |-close  to  /  and  is  a  union  of  at  most  d(  1  +  |)  intervals.  We  then  show  that  g  -  and  every 
other  function  that  is  a  union  of  at  most  d(l  +  |)  intervals  -  is  | -close  to  a  union  of  d  intervals. 

To  construct  the  function  g,  we  consider  the  “smoothed”  function  f$  :  [0, 1]  — >  [0, 1]  obtained 
by  taking  the  convolution  of  /  and  a  uniform  kernel  of  width  25.  We  define  r  to  be  some 
appropriately  small  parameter.  When  fs(x)  <  r,  then  this  means  that  nearly  all  the  points  in  the 
5-neighborhood  of  x  have  the  value  0  in  /,  so  we  set  g(x)  =  0.  Similarly,  when  fs(x)  >  1  —  r, 
then  we  set  g{x)  =  1.  (This  procedure  removes  any  “local  noise”  that  might  be  present  in  /.) 
This  leaves  all  the  points  x  where  r  <  fs(x)  <  1  —  r.  Let  us  call  these  points  undefined.  For 
each  such  point  x  we  take  the  largest  value  y  <  x  that  is  defined  and  set  g{x)  =  g(y). 

The  key  technical  part  of  the  proof  involves  showing  that  the  construction  described  above 
yields  a  function  g  that  is  e-close  to  /  and  that  is  a  union  of  d(  1  +  f )  intervals.  This  is  done  with 
standard  tools  from  function  analysis  and  probability  theory.  Due  to  space  constraints,  we  defer 
the  details  to  Appendix  2.7.  □ 
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The  noise  sensitivity  characterization  of  unions  of  intervals  obtained  by  Proposition  2.7  and 
Lemma  2.8  suggest  a  natural  approach  for  building  a  tester:  design  an  algorithm  that  estimates 
the  noise  sensitivity  of  the  input  function  and  accepts  iff  this  noise  sensitivity  is  small  enough. 
This  is  indeed  what  we  do: 

Union  of  Intervals  Tester!  f,d,e  ) 

Parameters:  5  =  ^ ,  r  =  0(e-3). 

1.  For  rounds  i  =  1, ...  ,r, 

1.1  Draw  x  £  [0, 1]  uniformly  at  random. 

1.2  Draw  samples  until  we  obtain  y  £  (x  —  5,  x  +  d  ). 

1.3  Set  Zi  =  l[f(x)  ^  f(y)\- 

2.  Accept  iff  <tW(l  +  i). 

The  algorithm  makes  2 r  =  0(e-3)  queries  to  the  function.  Since  a  draw  in  Step  1.2  is  in  the 
desired  range  with  probability  25,  the  number  of  samples  drawn  by  the  algorithm  is  a  random 
variable  with  very  tight  concentration  around  r(l  +  i)  =  0(d/e5).  The  draw  in  Step  1.2  also 
corresponds  to  choosing  y  x.  As  a  result,  the  probability  that  f(x)  ^  f(y)  in  a  given  round  is 
exactly  NS s(f),  and  the  average  ^  Z,  is  an  unbiased  estimate  of  the  noise  sensitivity  of  /.  By 
Proposition  2.7,  Lemma  2.8,  and  Chernoff  bounds,  the  algorithm  therefore  errs  with  probability 
less  than  |  provided  that  r  >  c  ■  1  /d5e  =  c  ■  32/e3  for  some  suitably  large  constant  c. 

Improved  unlabeled  sample  complexity:  Notice  that  by  changing  Steps  1. 1-1.2  slightly  to 
pick  the  first  pair  (x,  y)  such  that  \x  —  y\  <  5,  we  immediately  improve  the  unlabeled  sample 
complexity  to  0(Vd/e5)  without  affecting  the  analysis.  In  particular,  this  procedure  is  equivalent 
to  picking  x  e  [0, 1]  then  y  ~s  x.4  As  a  result,  up  to  poly(  1/e)  terms,  we  also  improve  over 
the  passive  testing  bounds  of  Kearns  and  Ron  [Kearns  and  Ron,  2000]  which  are  able  only  to 
distinguish  the  case  that  /  is  a  union  of  d  intervals  from  the  case  that  /  is  e-far  from  being  a 
4Except  for  events  of  0(5)  probability  mass  at  the  boundary. 
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union  of  d/e  intervals.  (Their  results  use  0{\fd/  el'b)  examples.)  Kearns  and  Ron  [Kearns  and 
Ron,  2000]  show  that  £l(y/d)  examples  are  necessary  for  passive  testing,  so  in  terms  of  d  this  is 
optimal. 

Active  Tester  Over  Arbitrary  Distributions:  We  can  reduce  the  problem  of  testing  over  general 
distributions  to  that  of  testing  over  the  uniform  distribution  on  [0, 1]  by  using  the  CDF  of  the 
distribution  D.  In  particular,  given  point  x,  define  px  =  Pr^ofy  <  x].  So,  for  x  drawn  from  D, 
px  is  uniform  in  [0,  l].5  As  a  result  we  can  just  replace  Step  1.2  in  the  tester  with  sampling  until 
we  obtain  y  such  that  py  G  (px  —  d,px  +  S).  The  only  issue  is  that  we  do  not  know  the  px  and 
py  values  exactly.  However,  VC-dimension  bounds  for  initial  intervals  on  the  line  imply  that  if 
we  sample  0(e_6<W2)  unlabeled  examples,  with  high  probability  the  estimates  px  computed  with 
respect  to  the  sample  (the  fraction  of  points  in  the  sample  that  are  <  x )  will  be  within  0(e:id  )  of 
the  correct  px  values  for  all  points  x.  This  in  turn  implies  that  the  noise-sensitivity  estimates  are 
sufficiently  accurate  that  the  procedure  works  as  before. 

Putting  these  results  together,  we  have  Theorem  2.4. 

2.3  Testing  Linear  Threshold  Functions 

In  the  last  section,  we  saw  how  unions  of  intervals  are  characterized  by  a  statistic  of  the  function 
-  namely,  its  noise  sensitivity  -  that  can  be  estimated  with  few  queries  and  used  this  to  build 
our  tester.  In  this  section,  we  follow  the  same  high-level  approach  for  testing  linear  threshold 
functions.  In  this  case,  however,  the  statistic  we  will  estimate  is  not  noise  sensitivity  but  rather 
the  sum  of  squares  of  the  degree- 1  Hermite  coefficients  of  the  function. 

Definition  2.9.  The  Hermite  polynomials  are  a  set  of  polynomials  h0(x)  =  1,  h  i  (x)  =  x,h2(x)  = 
—  1), . . .  that  form  a  complete  orthogonal  basis  for  (square-integrable)  functions  f  :  R  — > 
M  over  the  inner  product  space  defined  by  the  inner  product  (f,g)  =  E X[f(x)g(x)\,  where 

5We  are  assuming  here  that  D  is  continuous  and  has  a  pdf.  If  1)  has  point  masses,  then  instead  define  jf  = 
Pr v[y  <  x \  and p//  =  Pr y[y  <  x]  and  select^  uniformly  in  \px,px]. 
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the  expectation  is  over  the  standard  Gaussian  distribution  J\f{ 0, 1).  For  any  S  G  N",  define 
Hs  =  nit  i  hsfixi).  The  Hermite  coefficient  of  f  :  W1  — >  R  corresponding  to  S  is  f(S )  = 
( f,Hs )  =  Ex[f(x)Hs(x)}  and  the  Hermite  decomposition  of  f  is  f(x)  =  f(S)Hs(x). 

The  degree  of  the  coefficient  f(S)  is  (S'!  :=  ]R'I  =  i  <S 

The  connection  between  linear  threshold  functions  and  the  Hermite  decomposition  of  func¬ 
tions  is  revealed  by  the  following  key  lemma  of  Matulef  et  al.  [Matulef,  O’Donnell,  Rubinfeld, 
and  Servedio,  2009]. 

Lemma  2.10  (Matulef  et  al.  [Matulef,  O’Donnell,  Rubinfeld,  and  Servedio,  2009]).  There  is  an 
explicit  continuous  function  W  :  R.  — »  R  with  bounded  derivative  \  |  W' 1 1  <  1  and  peak  value 
W(0)  —  ^  such  that  every  linear  threshold  function  f  :  Rn  — >  {  —  1, 1}  satisfies  Xtt=i  /(ei)2  = 
W  (E,r /).  Moreover,  every  function  g  :  Rn  — >■  {—1, 1}  that  satisfies  E’Li  9(ei)2  ~  W  (E  xg)  \  < 
4e3,  is  e-close  to  being  a  linear  threshold  function. 

In  other  words,  Lemma  2.10  shows  that  }  f(ef)2  characterizes  linear  threshold  functions. 
To  test  LTFs,  it  suffices  to  estimate  this  value  (and  the  expected  value  of  the  function)  with 
enough  accuracy.  Matulef  et  al.  [Matulef,  O’Donnell,  Rubinfeld,  and  Servedio,  2009]  showed 
that  ]TV  fief)'2  can  be  estimated  with  a  number  of  queries  that  is  independent  of  n  by  querying  / 
on  pairs  x,y  G  R"  where  the  marginal  distributions  on  x  and  y  are  both  the  standard  Gaussian 
distribution  and  where  ( x,y )  =  q  for  some  small  (but  constant)  q  >  0.  Unfortunately,  the 
same  approach  does  not  work  in  the  active  testing  model  since  with  high  probability,  all  pairs 
of  samples  that  we  can  query  have  inner  product  \(x,y)\  <  0(4^-).  Instead,  we  rely  on  the 
following  result. 

Lemma  2.11.  For  any  function  f  :  — >  R,  we  have  Y^i=if(ei)2  =  ^‘x,y[f(x)f(y)(x,y}] 

where  (x.  y)  =  fff  -  ,  xty,  is  the  standard  vector  dot  product. 

Proof.  Applying  the  Hermite  decomposition  of  /  and  linearity  of  expectation, 

n 

E X,y[f(x)f{y)  {x,y)l  =  E  E  f(S)f(T)Ex[Hs(x)xi]Ey[HT(y)yi]. 

i= 1  S,Te  N" 
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By  definition,  xt  =  h\(xj)  =  He.(x).  The  orthonormality  of  the  Hermite  polynomials  therefore 
guarantees  that  E x[Hs{x)Hei(x)\  =  1  [S'  =  e*] .  Similarly,  E y[HT(y)y^\  =  1  [T  =  e;].  □ 

A  natural  idea  for  completing  our  LTF  tester  is  to  simply  sample  pairs  x,  y  G  M"  indepen¬ 
dently  at  random  and  evaluating  f(x)f(y)  (x.  y)  on  each  pair.  While  this  approach  does  give 
an  unbiased  estimate  of  E X,y[f(x)f(y)  ( x ,  y)],  it  has  poor  query  efficiency:  To  get  enough  accu¬ 
racy,  we  need  to  repeat  this  sampling  strategy  Q(n)  times.  (That  is,  the  query  complexity  of  this 
sampling  approach  is  the  same  as  that  of  learning  LTFs.) 

We  can  improve  the  query  complexity  of  the  sampling  strategy  by  instead  using  U-statistics. 
The  U-statistic  (of  order  2)  with  symmetric  kernel  function  g  :  M”  x  R"  — »  M  is 

:=  (”')  V  9(XV ), 

'  1  <i<j<m 

Tight  concentration  bounds  are  known  for  U-statistics  with  well-behaved  kernel  functions.  In 
particular,  by  setting  g{x,y)  =  f(x)f(y)  (x.  y)  1  [|  y)  <  r]  to  be  an  appropriately  truncated 
kernel  for  estimating  E [f(x)f(y)  (x,y)],  we  can  apply  a  Bernstein-type  inequality  due  to  Ar- 
cones  [Arcones,  1995]  to  show  that  0(y/n)  samples  are  sufficient  to  estimate  JR  /(ej)2  with 
sufficient  accuracy.  As  a  result,  the  following  algorithm  is  a  valid  tester  for  LTFs. 

LTF  Tester(  /,  e  ) 

Parameters:  r  =  \/4nlog(4n/e3),  m  =  800r/e3  +  32/e6. 

1.  Draw  oc  ^  oc  ^ ^  oc  independently  at  random  from  Rn. 

2.  Query  fix1),  f(x2), . . . ,  f(xm). 

3.  Set/i=  R^=i/(^)- 

4.  Set  z>  =  (™)~1'Ei¥;jf(xi)f{xj)  (x\xj)  ■  l[\{x\xj)\  <  r]. 

5.  Accept  iff  \u  —  W{p) \  <  2e3. 

The  algorithm  queries  the  function  only  on  inputs  that  are  all  independently  drawn  at  random 
from  the  n-dimensional  Gaussian  distribution.  As  a  result,  this  tester  works  in  both  the  active 
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and  passive  testing  models.  For  the  complete  proof  of  the  correctness  of  the  algorithm,  see 
Appendix  2.8. 

2.4  Testing  Disjoint  Unions  of  Testable  Properties 

We  now  show  that  active  testing  has  the  feature  that  a  disjoint  union  of  testable  properties  is 
testable,  with  a  number  of  queries  that  is  independent  of  the  size  of  the  union;  this  feature  does 
not  hold  for  passive  testing.  In  addition  to  providing  insight  into  the  distinction  between  the 
two  models,  this  fact  will  be  useful  in  our  analysis  of  semi-supervised  learning-based  properties 
mentioned  below  and  discussed  more  fully  in  Appendix  2.11. 

Specifically,  given  properties  V\, ... ,  Vn  over  domains  X1,. . . ,  XN,  define  their  disjoint 
union  V  over  domain  X  =  {(i,x)  :  i  e  [N ] ,  x  e  Xt}  to  be  the  set  of  functions  /  such  that 
f(i,  x )  =  fi(x)  for  some  f,  e  V, .  In  addition,  for  any  distribution  D  over  X,  define  I),  to  be  the 
conditional  distribution  over  Xt  when  the  first  component  is  i.  If  each  V,  is  testable  over  I),  then 

V  is  testable  over  D  with  only  small  overhead  in  the  number  of  queries: 

Theorem  2.12.  Given  properties  V\ . . . . ,  Vn,  if  each  V,  is  testable  over  I),  with  q(e)  queries  and 
U (e)  unlabeled  samples,  then  their  disjoint  union  V  is  testable  over  the  combined  distribution  D 
with  0(q(e/ 2)  •  (log3  -))  queries  and  0(U(e/2)  ■  log3  ^))  unlabeled  samples. 

Proof.  See  Appendix  2.9.  □ 

As  a  simple  example,  consider  V,  to  contain  just  the  constant  functions  1  and  0.  In  this  case, 

V  is  equivalent  to  what  is  often  called  the  “cluster  assumption,”  used  in  semi-supervised  and 
active  learning  [Chapelle,  Schlkopf,  and  Zien,  2006,  Dasgupta,  2011],  that  if  data  lies  in  some 
number  of  clearly  identifiable  clusters,  then  all  points  in  the  same  cluster  should  have  the  same 
label.  Here,  each  V,  individually  is  easily  testable  (even  passively)  with  0(1/ e)  labeled  samples, 
so  Theorem  2.12  implies  the  cluster  assumption  is  testable  with  poly(  1/e)  queries.6  However,  it 

6Since  the  V,  are  so  simple  in  this  case,  one  can  actually  test  with  only  0(1/ e)  queries. 


25 


is  not  hard  to  see  that  passive  testing  with  poly{  1/e)  samples  is  not  possible  and  in  fact  requires 
Q(y/N /e)  labeled  examples.7 

We  build  on  this  to  produce  testers  for  other  properties  often  used  in  semi-supervised  learning. 
In  particular,  we  prove  the  following  result  about  testing  the  margin  property  (See  Appendix  2.1 1 
for  definitions  and  analysis). 

Theorem  2.13.  For  any  7,  7'  =  7(1  —  1  /c)  for  constant  c  >  1 ,  for  data  in  the  unit  ball  in  Rd  for 
constant  d,  we  can  distinguish  the  case  that  Df  has  margin  7  from  the  case  that  Df  is  e-farfrom 
margin  7'  using  Active  Testing  with  0{l/(y2de2))  unlabeled  examples  and  0(  1/e)  label  requests. 

2.5  General  Testing  Dimensions 

The  previous  sections  have  discussed  upper  and  lower  bounds  for  a  variety  of  classes.  Here, 
we  define  notions  of  testing  dimension  for  passive  and  active  testing  that  characterize  (up  to 
constant  factors)  the  number  of  labels  needed  for  testing  to  succeed,  in  the  corresponding  testing 
protocols.  These  will  be  distribution-specific  notions  (like  SQ  dimension  in  learning),  so  let  us 
fix  some  distribution  D  over  the  instance  space  X,  and  furthermore  fix  some  value  e  defining  our 
goal.  I.e.,  our  goal  is  to  distinguish  the  case  that  distoif,  V)  =  0  from  the  case  distoif,  V)  >  e. 

For  a  given  set  S  of  unlabeled  points,  and  a  distribution  7 r  over  boolean  functions,  define  ns 
to  be  the  distribution  over  labelings  of  S  induced  by  n.  That  is,  for  y  e  {0,  l}!5!  let  ns(y)  = 
Pr f~n[f(S)  =  y}.  We  now  use  this  to  define  a  distance  between  distributions.  Specifically,  given 
a  set  of  unlabeled  points  S  and  two  distributions  7r  and  n'  over  boolean  functions,  define 

D5(tt, 7 r')  =  (1/2)  \ns(y)  ~  n's(y) I, 

2/G{0,1}IsI 

Specifically,  suppose  region  1  has  1  —  2e  probability  mass  with  f  \  G  V\ ,  and  suppose  the  other  regions  equally 
share  the  remaining  2e  probability  mass  and  either  (a)  are  each  pure  but  random  (so  /  £  V)  or  (b)  are  each  50/50 
(so  /  is  e-far  from  V).  Distinguishing  these  cases  requires  seeing  at  least  two  points  with  the  same  index  if  1, 
yielding  the  / e)  bound. 
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to  be  the  variation  distance  between  n  and  n'  induced  by  S.  Finally,  let  II0  be  the  set  of  all 
distributions  n  over  functions  in  V.  and  let  set  I  f  be  the  set  of  all  distributions  n'  in  which  a 
1  —  o(l)  probability  mass  is  over  functions  at  least  e-far  from  V.  We  are  now  ready  to  formulate 
our  notions  of  dimension. 

Definition  2.14.  Define  the  passive  testing  dimension ,  dpassive,  as  the  largest  q  e  N  such  that, 

sup  sup  Pr  (Ds(7t,  7r')  >  1/4)  <  1/4. 

iren0  Tr'em 

That  is,  there  exist  distributions  n  and  it’  such  that  a  random  set  S  of  dpassive  examples  has  a 
reasonable  probability  (at  least  3/4)  of  having  the  property  that  one  cannot  reliably  distinguish 
a  random  function  from  n  versus  a  random  function  from  rd  from  just  the  labels  of  S.  From  the 
definition  it  is  fairly  immediate  that  D(dpaSsive )  examples  are  necessary  for  passive  testing;  in 
fact,  O  (dpassive)  are  sufficient  as  well. 

Theorem  2.15.  The  sample  complexity  of  passive  testing  is  Q(dpassive). 

Proof.  See  Appendix  2.10.  □ 

For  the  case  of  active  testing,  there  are  two  complications.  First,  the  algorithms  can  examine 
their  entire  poly(n) -sized  unlabeled  sample  before  deciding  which  points  to  query,  and  secondly 
they  may  in  principle  determine  the  next  query  based  on  the  responses  to  the  previous  ones  (even 
though  all  our  algorithmic  results  do  not  require  this  feature).  If  we  merely  want  to  distinguish 
those  properties  that  are  actively  testable  with  0(1)  queries  from  those  that  are  not,  then  the 
second  complication  disappears  and  the  first  is  simplified  as  well,  and  the  following  coarse  notion 
of  dimension  suffices. 

Definition  2.16.  Define  the  coarse  active  testing  dimension,  dcoarse,  as  the  largest  q  G  N  such 
that, 

sup  sup  Pr  (Ds(7r,7r/)  >  1/4)  <  l/nq. 

7i-en0  Tr'em  s~Dq 

Theorem  2.17.  If  dcoarse  =  0(1)  the  active  testing  ofV  can  be  done  with  0(1)  queries,  and  if 
dcoarse  =  w(l)  then  it  cannot. 
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Proof.  S  ee  Appendix  2.10. 


□ 


To  achieve  a  more  fine-grained  characterization  of  active  testing  we  consider  a  slightly  more 
involved  quantity,  as  follows.  First,  recall  that  given  an  unlabeled  sample  U  and  distribution  7r 
over  functions,  we  define  ttu  as  the  induced  distribution  over  labelings  of  U.  We  can  view  this  as 
a  distribution  over  unlabeled  examples  in  {0,  l}^.  Now,  given  two  distributions  over  functions 
7 r,  7 r',  define  Fair(7r,  tt',  U)  to  be  the  distribution  over  labeled  examples  (y,  £)  defined  as:  with 
probability  1/2  choose  y  ~  -itu,  i  =  1  and  with  probability  1/2  choose  y  ~  tt/,  £  =  0.  Thus,  for 
a  given  unlabeled  sample  U,  the  sets  n0  and  II f  define  a  class  of  fair  distributions  over  labeled 
examples.  The  active  testing  dimension,  roughly,  asks  how  well  this  class  can  be  approximated 
by  the  class  of  low-depth  decision  trees.  Specifically,  let  DT^  denote  the  class  of  decision  trees 
of  depth  at  most  k.  The  active  testing  dimension  for  a  given  number  u  of  allowed  unlabeled 
examples  is  as  follows: 

Definition  2.18.  Given  a  number  u  =  poly(n)  of  allowed  unlabeled  examples,  we  define  the 
active  testing  dimension,  dactive(u),  as  the  largest  q  G  N  such  that 

sup  sup  Pr  (err*(DT?,Fair(7r,7r',  U))  <  1/4)  <  1/4, 

iren0  Tr'eip  u~du 

where  err*  (II .  P)  is  the  error  of  the  optimal  function  in  H  with  respect  to  data  drawn  from 
distribution  P  over  labeled  examples. 

Theorem  2.19.  Active  testing  with  failure  probability  |  using  u  unlabeled  examples  requires 
f2(< iactiveiu))  label  queries,  and  furthermore  can  be  done  with  0(u)  unlabeled  examples  and 
0(dactlve('u))  label  queries. 

Proof.  See  Appendix  2.10.  □ 

We  now  use  these  notions  of  dimension  to  prove  lower  bounds  for  testing  several  properties. 
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2.5.1  Application:  Dictator  functions 


We  now  prove  Theorem  2.3  that  active  testing  of  dictatorships  over  the  uniform  distribution  re¬ 
quires  Q(logn)  queries  by  proving  a  fl(logn)  lower  bound  on  dactive(u )  for  any  u  =  poly(n );  in 
fact,  this  result  holds  even  for  the  specific  choice  of  n'  as  random  noise  (the  uniform  distribution 
over  all  functions). 


Proof  of  Theorem  2.3.  Define  n  and  tt'  to  be  uniform  distributions  over  the  dictator  functions  and 
over  all  boolean  functions,  respectively.  In  particular,  ^ r  is  the  distribution  obtained  by  choosing 
i  e  [n]  uniformly  at  random  and  returning  the  function  /  :  {0, 1}"  — >  {0, 1}  defined  by  f(x)  = 
Xi.  Fix  S  to  be  a  set  of  q  vectors  in  {0,  l}n.  This  set  can  be  viewed  as  a  q  x  n  boolean-valued 
matrix.  We  write  C]  (S'), . . . ,  cn(S)  to  represent  the  columns  of  this  matrix.  For  any  y  e  {0,  l}q. 


*s(y) 


10  €  [n]  :  Cj(S )  =  y}\ 
n 


and  n 's(y)  =  2  q. 


By  Lemma  2.21,  to  prove  that  dactive  >  |  log  n,  it  suffices  to  show  that  when  q  <  |  log  n 
and  U  is  a  set  of  nc  vectors  chosen  uniformly  and  independently  at  random  from  {0,  l}n,  then 
with  probability  at  least  |,  every  set  S  C  U  of  size  \S\  =  q  and  every  y  e  (0,  l}9  satisfy 
7 Ts(y)  <  §2~9.  (This  is  like  a  stronger  version  of  dcoarse  where  D5(7t,  h’)  is  replaced  with  an 
distance.) 

Consider  a  set  S  of  q  vectors  chosen  uniformly  and  independently  at  random  from  {0,  l}n. 
For  any  vector  y  e  {0,  l}9,  the  expected  number  of  columns  of  S  that  are  equal  to  y  is  n2~q. 
Since  the  columns  are  drawn  independently  at  random,  Chernoff  bounds  imply  that 


Pr  [t rs(y)  >  p~q]  <  e~&2n2  9/3  <  e~f5n2  9. 


By  the  union  bound,  the  probability  that  there  exists  a  vector  y  e  {0,  l}q  such  that  more  than 
|n2_IJ  columns  of  S  are  equal  to  y  is  at  most  2qe~^n2  9 .  Furthermore,  when  U  is  defined  as 
above,  we  can  apply  the  union  bound  once  again  over  all  subsets  S  C  U  of  size  1 5’ |  =  q  to  obtain 

PrpS)  y  :  7 Ts(y)  >  |2~9]  <  ncq  ■  2q  ■  e~^n2  9 .  When  q  <  |  log  n,  this  probability  is  bounded 
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above  by  l°s2n+l  lo§n  75^,  which  is  less  than  j  when  n  is  large  enough,  as  we  wanted  to 
show.  □ 

2.5.2  Application:  LTFs 

The  testing  dimension  also  lets  us  prove  the  lower  bounds  in  Theorem  2.5  regarding  the  query 
complexity  for  testing  linear  threshold  functions.  Specifically,  those  bounds  follow  directly  from 
the  following  result. 

Theorem  2.20.  For  linear  threshold  functions  under  the  standard  n-dimensional  Gaussian  dis¬ 
tribution,  dpassive  =  fi(vWl°g(n))  and  dactive  =  0((n/  log(n)) 1/3). 

Let  us  give  a  brief  overview  of  the  strategies  used  to  obtain  the  dpaSsive  and  dactive  bounds. 
The  complete  proofs  for  both  results,  as  well  as  a  simpler  proof  that  dcoarse  =  logn)1/3), 

can  be  found  in  Appendix  2.10.4. 

For  both  results,  we  set  tt  to  be  a  distribution  over  LTFs  obtained  by  choosing  w  ~  J\f(  0,  Inxn ) 
and  outputting  f(x )  =  sgn (w  ■  x ).  Set  it'  to  be  the  uniform  distribution  over  all  functions — i.e., 
for  any  x  £  Mn,  the  value  of  f(x)  is  uniformly  drawn  from  {0, 1}  and  is  independent  of  the  value 
of  /  on  other  inputs. 

To  bound  dpassive,  we  bound  the  total  variation  distance  between  the  distribution  of  Xw/ y/n 
given  X,  and  the  standard  normal  A/"(0,  Inxn).  If  this  distance  is  small,  then  so  must  be  the 
distance  between  the  distribution  of  sgn(Aiw)  and  the  uniform  distribution  over  label  sequences. 

Our  strategy  for  bounding  dactive  is  very  similar  to  the  one  we  used  to  prove  the  lower  bound 
on  the  query  complexity  for  testing  dictator  functions  in  the  last  section.  Again,  we  want  to 
apply  Lemma  2.21.  Specifically,  we  want  to  show  that  when  q  <  o((n/  log(n))1/3)  and  U  is  a 
set  of  nc  vectors  drawn  independently  from  the  n-dimensional  standard  Gaussian  distribution, 
then  with  probability  at  least  |,  every  set  S  C  U  of  size  | ,S'  =  q  and  almost  all  x  £  M9,  we  have 
7ts(x)  <  §2-9.  The  difference  between  this  case  and  the  lower  bound  for  dictator  functions  is 
that  we  now  rely  on  strong  concentration  bounds  on  the  spectrum  of  random  matrices  [Vershynin, 
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2012]  to  obtain  the  desired  inequality. 


2.6  Proof  of  a  Property  Testing  Lemma 

The  following  lemma  is  a  generalization  of  a  lemma  that  is  widely  used  for  proving  lower  bounds 
in  property  testing  [Fischer,  2001,  Lem.  8.3].  We  use  this  lemma  to  prove  the  lower  bounds  on 
the  query  complexity  for  testing  dictator  functions  and  testing  linear  threshold  functions. 
Lemma  2.21.  Let  tt  and  i r'  be  two  distributions  on  functions  X  — >  R.  Fix  U  C  A"  to  be  a  set 
of  allowable  queries.  Suppose  that  for  any  S  C  U,  \S\  —  q,  there  is  a  set  Eg  C  RR  ( possibly 
empty)  satisfying  ns(Es)  <  |2~9  such  that 

Tt s(y)  <  | Tt's{y)  for  every  y  eRq\  Es. 

Then  err* (DTq,  Fair >  1/4. 

Proof  Consider  any  decision  tree  A  of  depth  q.  Each  internal  node  of  the  tree  consists  of  a 
query  y  £  U  and  a  subset  TCI  such  that  its  children  are  labeled  by  T  and  R  \  T,  respectively. 
The  leaves  of  the  tree  are  labeled  with  either  “accept”  or  “reject”,  and  let  L  be  the  set  of  leaves 
labeled  as  accept.  Each  leaf  i  G  L  corresponds  to  a  set  Se  C  Uq  of  queries  and  a  subset  T<  C  |f, 
where  /  :  X  — >  M  leads  to  the  leaf  i  iff  f(Sr)  e  7).  The  probability  that  A  (correctly)  accepts 
an  input  drawn  from  7r  is 

ai  =  /  nse(y)dy. 

eeL  ’’Ti 

Similarly,  the  probability  that  A  (incorrectly)  accepts  an  input  drawn  from  A  is 


The  difference  between  the  two  rejection  probabilities  is  bounded  above  by 


ai  -  a2  <  /  TtSpiy)  -  Tt'Se(y)dy  +  ^  /  Ttse{y)dy. 

e£L  d Ti\Est  gGL  JTlnEsl 
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The  conditions  in  the  statement  of  the  lemma  then  imply  that 


CL\  —  Cl2 


[  l7rse(y)dy+  [ 
e&L  dTe  t  JESl 


Ese{y)dy  < 


1 

3' 


To  complete  the  proof,  we  note  that  A  errs  on  an  input  drawn  from  Fair(7r,  it',  U )  with  probability 


^(1  -  ai)  +  \a-2  =  \  -  \(ai  -  o2)  >  §. 


□ 


2.7  Proofs  for  Testing  Unions  of  Intervals 

In  this  section  we  complete  the  proofs  of  the  technical  results  in  Section  2.2. 

Proposition  2.7  (Restated).  Fix  5  >  0  and  let  f  :  [0, 1]  — >  {0, 1}  be  a  union  ofd  intervals.  Then 
NS s(f)  <  dS. 


Proof.  For  any  fixed  b  e  [0, 1],  the  probability  that  x  <  b  <  y  when  x  ~  U( 0, 1)  and  y  ~ 

U{x  —  8,  x  +  5)  is 


Pr[a:  <  b  <  y]  = 


x,y 


/  Pr  \y  >  b]dt  = 

f  q  y~U(b—t—8,b—t+5) 


5  —  t 


,  5 

W‘U  =  4' 


Similarly,  Pr x>y[y  <  b  <  x]  =  |.  So  the  probability  that  b  lies  between  x  and  y  is  at  most  f . 
When  /  is  the  union  of  d  intervals,  f(x)  f  f(y )  only  if  at  least  one  of  the  boundaries 
b2d  of  the  intervals  of  /  lies  in  between  x  and  y.  So  by  the  union  bound,  Pr[/(x)  f 
f(y)\  <  2d{5/2)  =  dd.  Note  that  if  b  is  within  distance  5  of  0  or  1,  the  probability  is  only 
lower.  □ 


Lemma  2.8  (Restated).  Fix  5  =  Let  f  :  [0, 1]  — »  {0, 1}  be  any  function  with  noise  sensitivity 
NS  s(f)  <  dS(  1  +  |).  Then  f  is  e-close  to  a  union  ofd  intervals. 

Proof.  The  proof  proceeds  in  two  steps:  We  first  show  that  /  is  | -close  to  a  union  of  d(l  +  |) 
intervals,  then  we  show  that  every  union  of  d{  1  +  |)  intervals  is  ^ -close  to  a  union  of  d  intervals. 
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Consider  the  “smoothed”  function  fs  :  [0, 1]  — *  [0, 1]  defined  by 


-i  t-x+S 

fs(x)  =  E y~5Xf(y)  =  ^  J  s  f(y)dV- 

The  function  fs  is  the  convolution  of  /  and  the  uniform  kernel  o  :  E  — »  [0, 1]  defined  by 

00)  =  ^![0I  <  <*]• 

Fix  r  =  ^NSs(f).  We  introduce  the  function  g*  :  [0, 1]  — »  {0, 1,  *}  by  setting 


g*(x ) 


1  when  fs(x)  >  1  —  r, 
<  0  when  fs(x)  <  r,  and 
*  otherwise 


for  all  x  G  [0, 1].  Finally,  we  define  g  :  [0, 1]  — *  {0, 1}  by  setting  g(x)  =  g*(y)  where  y  <  x  is 
the  largest  value  for  which  g(y)  *.  (If  no  such  y  exists,  we  fix  g(x)  =  0.) 


We  first  claim  that  dist(f,  g)  <  To  see  this,  note  that 


dist(f,g )  =  Pr[/(x)  ^  g(x)\ 

X 

<  Pr[(y(*(a;)  =  *]  +  Pr[/(x)  =  0  A  g*(x)  —  1]  +  Pr [f(x)  =  1  A  g*(x)  =  0] 

XX  X 

=  Pr[r  <  fs(x)  <  1  -  r]  +  Pr[/(x)  =  0  A  fs(x)  >  1  -  r]  +  Pr[/(x)  =  1  A  fs(x)  <  r], 

XX  X 

We  bound  the  three  terms  on  the  RHS  individually.  For  the  first  term,  we  observe  that  NS^(/,  x)  = 
minj/^o;),  1  —  fs(x)}  and  that  ErNS s(f,x)  =  N From  these  identities  and  Markov’s  in¬ 
equality,  we  have  that 

Pr[r  <  fs(x)  <  1  -  r]  =  Pr[NS5(/,a;)  >  r]  <  =  £. 

x  x  T  4 

For  the  second  term,  let  S  C  [0,1]  denote  the  set  of  points  x  where  f(x)  =  0  and  fs(x)  >  1  —  r. 
Let  r  C  S  represent  a  5-net  of  S.  Clearly,  |T|  <  For  x  E  T,  let  Bx  =  (x  —  <5,  x  +  5)  be  a 
ball  of  radius  5  around  x.  Since  fs(x)  >  1  —  r,  the  intersection  of  S  and  Bx  has  mass  at  most 
| S'  D  Bx\  <  t5.  Therefore,  the  total  mass  of  S  is  at  most  |S|  <  |r|r5  =  r.  By  the  bounds  on  the 
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noise  sensitivity  of  /  in  the  lemma’s  statement,  we  therefore  have 


Pr [f(x)  =  0  A  fs(x)  >  1  -  r]  <  r  <  |. 

X 

Similarly,  we  obtain  the  same  bound  on  the  third  term.  As  a  result,  dist(f,  g)  <  f  +  f  +  f  —  f  > 
as  we  wanted  to  show. 


We  now  want  to  show  that  g  is  a  union  of  m  <  dd  ( 1  +  |)  intervals.  Each  left  boundary  of  an 
interval  in  g  occurs  at  a  point  x  G  [0,1]  where  g*(x)  =  *,  where  the  maximum  y  <  x  such  that 
g*(y)  ^  *  takes  the  value  g*(y)  =  0,  and  where  the  minimum  z  >  x  such  that  g*(z )  ^  *  has 
the  value  g*(z)  =  1.  In  other  words,  for  each  left  boundary  of  an  interval  in  g,  there  exists  an 
interval  (y,  z)  such  that  fs(y)  <  r,  fs(z)  >  1  —  r,  and  for  each  y  <  x  <  z,  fs(x)  E  (r,  1  —  r). 
Fix  any  interval  (y,  z).  Since  fs  is  the  convolution  of  /  with  a  uniform  kernel  of  width  25,  it 
is  Lipschitz  continuous  (with  Lipschitz  constant  i).  So  there  exists  x  G  (y,  z)  such  that  the 
conditions  fs(x)  =  x  —  y  >  28(\  —  t),  and  z  ~  x  >  25{\  —  t)  a\\  hold.  As  a  result, 


NS«s(/,  t)  dt  =  /  NSs(f,t)dt+  ms(f1t)dt>25a-rf 


Similarly,  for  each  right  boundary  of  an  interval  in  g,  we  have  an  interval  (y,  z)  such  that 


NS s(f,t)  dt  >  25(\  -  t)“ 


The  intervals  (y,  z)  for  the  left  and  right  boundaries  are  all  disjoints,  so 


2m 


's(f)  >  X/  /  NS<5(/>  t)  dt  -  2m\(l  ~  2r)S 

i=i  y 


This  means  that 


d5(  1  +  e/4) 

and  g  is  a  union  of  at  most  d(l  +  I)  intervals,  as  we  wanted  to  show. 


Finally,  we  want  to  show  that  any  function  that  is  the  union  of  m  <  d(  l  +  |)  intervals  is  |- 
close  to  a  union  of  d  intervals.  Fet  . . . ,  £m  represent  the  lengths  of  the  intervals  in  g.  Clearly, 
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i\  +  •  •  •  +  lm  <  1,  so  there  must  be  a  set  S  of  m  —  d  <  de/2  intervals  in  /  with  total  length 

m  —  d  de/2  e 
1  ~  m  ~  d(  1  +  |)  2 

Consider  the  function  h  :  [0, 1]  — >  {0, 1}  obtained  by  removing  the  intervals  in  S  from  g  (i.e., 
by  setting  h(x)  =  0  for  the  values  x  G  [b2i-i,b2i]  for  some  i  e  S ).  The  function  h  is  a  union 
of  d  intervals  and  dist(g,h )  <  This  completes  the  proof,  since  dist(f,h )  <  dist(f,g )  + 
dist(g ,  h )  <  e.  □ 


2.8  Proofs  for  Testing  LTFs 


We  complete  the  proof  that  LTFs  can  be  tested  with  Oi  sfTi  )  samples  in  this  section. 

For  a  fixed  function  /  :  Rn  — >  M,  define  g  :  Mn  x  Mn  — >  R  to  be  g(x ,  r/)  =  f{x)f(y)  (x,  y). 
Let  g*  :  Rn  x  M”  — >  R  be  the  truncation  of  g  defined  by  setting 

I  f{x)f{y)  (x,y)  if  |  (x,y)  \  <  ^4nlog(4n/e3) 

9*(x,v)=  < 

0  otherwise. 

Our  goal  is  to  estimate  E g.  The  following  lemma  shows  that  Eg*  provides  a  good  estimate  of 
this  value. 

Lemma  2.22.  Let  g,g*  :  M”  x  Mn  — >  M  be  defined  as  above.  Then  |E g  —  Eg* \  <  ^e3. 


Proof.  For  notational  clarity,  fix  r  =  \J An  log (4 n/e3).  By  the  definition  of  g  and  g*  and  with 
the  trivial  bound  \f(x)f(y)  ( x ,  y)  \  <  n  we  have 


l%-%1  = 


x,y 


Pr  [\(x,y)\  >  t]  •  E x%y  f(x)f{y)  (x,  y)  \  \(x,y)\>r 


<  n  ■  Pr  \(x,y)\  >  t  . 


x.y 


The  right-most  term  can  be  bounded  with  a  standard  Chernoff  argument.  By  Markov’s  inequality 
and  the  independence  of  the  variables  Xi, . . .  ,xn,yi, . . .  ,yn, 

l^et(x,y)  J-£=i  E etxM 


Pr  [  (x,  y)  >  r\  —  Pr  [et{x’y)  >  etT]  < 


x,y 


-  e  tr 


otr 
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The  moment  generating  function  of  a  standard  normal  random  variable  is  Eety  =  e/-2//2,  so 


EXiiVi[etXiyi]  =EXi[EyietXiyi]  =  EXie^2)xG 

When  x  ~  J\f( 0, 1),  the  random  variable  x2  has  a  x2  distribution  with  1  degree  of  freedom.  The 
moment  generating  function  of  this  variable  is  Eetx2  =  \[ybrt  =  \/l  +  for  any  t  <  \ . 
Hence, 

EXie<*W  < 

for  any  t  <  1.  Combining  the  above  results  and  setting  t  —  yields 

nt 2  .  r2  o 

Pr  [  (x,  y)  >  t]  <  e20-*2)  <  e~ ^  =  f-. 

x,y 

The  same  argument  shows  that  Pr[(x,  y)  <  — r]  <  as  well.  □ 

The  reason  we  consider  the  truncation  g*  is  that  its  smaller  norm  will  enable  us  to  apply 
a  strong  Bernstein-type  inequality  on  the  concentration  of  measure  of  the  U-statistic  estimate  of 

Eg*. 

Lemma  2.23  (Arcones  [Arcones,  1995]).  For  a  symmetric  function  h  :  Mn  x  R"  — >  M,  let  E2  = 
EiT[E.y[/i(a:,  r/)]2]  —  EX:V[h(x,  y)]2,  let  b  =  \\h  —  E/iHoo,  and  let  Um(h )  be  a  random  variable  ob¬ 
tained  by  drawing  x1, . . .  ,xm  independently  at  random  and  setting  Um(h )  =  ('")  1  h(x\  a9) 
Then  for  every  t  >  0, 

Pr[|C/„.(ft)  -  m\  >  t]  <  4  exp  (8g  +  ^  j  • 

We  are  now  ready  to  complete  the  proof  of  the  upper  bound  of  Theorem  2.5. 

Theorem  2.24  (Upper  bound  in  Theorem  2.5,  restated).  Linear  threshold  functions  can  be  tested 
over  the  standard  n-dimensional  Gaussian  distribution  with  0  ( f  n  log  n)  queries  in  both  the 
active  and  passive  testing  models. 

Proof.  Consider  the  LTF-Tester  algorithm.  When  the  estimates  jl  and  v  satisfy 

\p-Ef\<e3  and  \v  -  E[f(x)f{y)  {x,  y)}\  <  e3, 
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Lemmas  2.10  and  2.1 1  guarantee  that  the  algorithm  correctly  distinguishes  LTFs  from  functions 
that  are  far  from  LTFs.  To  complete  the  proof,  we  must  therefore  show  that  the  estimates  are 
within  the  specified  error  bounds  with  probability  at  least  2/3. 

The  values  /(a;1), . . . ,  f(xm )  are  independent  {  —  1,  l}-valued  random  variables.  By  Hoeffd- 
ing’s  inequality, 

Pr [|/x  -  E/|  <  e3]  >  1  -  2e"e6m/2  =  1  -  2e"°(yH). 

The  estimate  z>  is  a  U-statistic  with  kernel  g*  as  defined  above.  This  kernel  satisfies 

II g*  -  %*||oo  <  2113*1100  =  2v/4nlog(4n/e3) 

and 

S2  <  Ey[Ex[g*(x,y)]2]  =E  y[Ex[f(x)f(y)  (x,  y)  l[|(x,  y)\  <r]]2]. 

For  any  two  functions  0,0  :  — >  M,  when  0  is  {0,l}-valued  the  Cauchy-Schwarz  in¬ 

equality  implies  that  Ex[(j)(x)ip(x)]2  <  Ex[c/)(x)]Ex[(/)(x)/ip(x)2]  =  Ex[0(x)]Ex[0(a;)0(x)]  and 
so  Ex[cj)(x)f>(x)]2  <  Ex[0(x)].  Applying  this  inequality  to  the  expression  for  E2  gives 

n  n 

E2  <  Ey[E  X[f(x)f(y)  {x,y)}2]  =  Ey  [(  f(y)yiEx[f(x)xi])2]  =  ^  /(^/(e^Ej?/^]  =  /(e*)2- 

i=  1  i,j  i=  1 

By  Parseval’s  identity,  we  have  /(e02  <  ||/|||  =  ||/||2  =  1.  Lemmas  2.22  and  2.23  imply 
that 

_ mi2 _ 

Pr[|z>  -  E#|  <  e3]  =  Pr[|F  -  E^*|  <  4e3]  >  1  -  4e  8+200 >  11. 

The  union  bound  completes  the  proof  of  correctness.  □ 

2.9  Proofs  for  Testing  Disjoint  Unions 

Theorem  2.12  (Restated).  Given  properties  V\. ... ,  TV,  if  each  'P,  is  testable  over  Dt  with  q(e) 
queries  and  U (e)  unlabeled  samples,  then  their  disjoint  union  V  is  testable  over  the  combined 
distribution  D  with  0(q(e/ 2)  •  (log3  -))  queries  and  0(U(e/ 2)  •  (—  log3  /))  unlabeled  samples. 
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Proof.  Let  p  =  (p1, . . .  ,pN)  denote  the  mixing  weights  for  distribution  D;  that  is,  a  random 
draw  from  D  can  be  viewed  as  selecting  i  from  distribution  p  and  then  selecting  x  from  I), .  We 
are  given  that  each  V,  is  testable  with  failure  probability  1/3  using  using  q(e)  queries  and  U(e) 
unlabeled  samples.  By  repetition,  this  implies  that  each  is  testable  with  failure  probability  5  using 
qs(e)  =  0(q(e)  log(l/<5))  queries  and  Ug(e)  =  0(U(e)  log(l/5))  unlabeled  samples,  where  we 
will  set  5  =  e2.  We  now  test  property  V  as  follows: 

Fore'  =  1/2, 1/4, 1/8,...,  e/2  do: 

Repeat  0(j  log(l/e))  times: 

1.  Choose  a  random  (i,  x)  from  D. 

2.  Sample  until  either  Ug(e')  samples  have  been  drawn  from  D,  or  (8N/e)Us(e') 
samples  total  have  been  drawn  from  D,  whichever  comes  first. 

3.  In  the  former  case,  run  the  tester  for  property  Vt  with  parameter  e',  making 
qs(e')  queries.  If  the  tester  rejects,  then  reject. 

If  all  runs  have  accepted,  then  accept. 


First  to  analyze  the  total  number  of  queries  and  samples,  since  we  can  assume  q(e)  >  1/e  and 
U(e)  >  1/e,  we  have  qs{e')e'/e  =  0(q${e/ 2))  and  U§  (e')e'/e  =  0(Ug(e/ 2))  fore'  >  e/2.  Thus, 
the  total  number  of  queries  made  is  at  most 


%(e/2)  1°g(1/e)  =  O  (  <?(e/2)  '  log3 

e'  ^ 

and  the  total  number  of  unlabeled  samples  is  at  most 


5/  ^CWe/2)  log(l/e)  =  O  (u(e/2)j  log3  ^  . 

Next,  to  analyze  correctness,  if  indeed  /  e  V  then  each  call  to  a  tester  rejects  with  probability 
at  most  <5  so  the  overall  failure  probability  is  at  most  (S/e)  log2(l/e)  <  1/3;  thus  it  suffices  to 
analyze  the  case  that  distoif.  V)  >  e. 
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If  distD(f,  V)  >  e  then  YhvPi>e/(w)  Pi  •  distDi(fi}  Vi)  >  3e/4.  Moreover,  for  indices  i  such  that 
Pi  >  f/(4Ar),  with  high  probability  Step  2  draws  C4(e')  samples,  so  we  may  assume  for  such 
indices  the  tester  for  Vi  is  indeed  run  in  Step  3.  Let  I  =  {i  :  pt  >  e/(4 N)  and  distDi  (/*,  V,)  > 
e/2}.  Thus,  we  have 

^2  Pi  ■  distDi(fi,Vi )  >  e/4. 

iei 

Let  Iei  =  {i  G  I  :  distDi(fi,  Vi)  G  [e',  2e']}.  Bucketing  the  above  summation  by  values  e'  in  this 
way  implies  that  for  some  value  e'  G  {e/2,  e,  2e, . . . ,  1/2},  we  have: 

^2  Pi  >  e/(8e'log(l/e)). 

ieiei 

This  in  turn  implies  that  with  probability  at  least  2/3,  the  run  of  the  algorithm  for  this  value  of  e' 
will  find  such  an  i  and  reject,  as  desired.  □ 


2.10  Proofs  for  Testing  Dimensions 

2.10.1  Passive  Testing  Dimension  (proof  of  Theorem  2.15) 

Lower  bound:  By  design,  dpassive  is  a  lower  bound  on  the  number  of  examples  needed  for 
passive  testing.  In  particular,  if  Ds(7t,  n')  <1/4,  and  if  the  target  is  with  probability  1/2  chosen 
from  7t  and  with  probability  1/2  chosen  from  n',  even  the  Bayes  optimal  tester  will  fail  to  identify 
the  correct  distribution  with  probability  \  Z]j/g{o  i}Isi  minlTs/?/),  n's(y))  =  {(1  —  Ds(7t,  n'))  > 
3/8.  The  definition  of  dpaSsive  implies  that  there  exist  n  G  If0,  vr'  G  nt  such  that  Prs(D5(7r,  n')  < 
1/4)  >  3/4.  Since  V  has  a  1  —  o(l)  probability  mass  on  functions  that  are  e-far  from  V,  this 
implies  that  over  random  draws  of  S  and  /,  the  overall  failure  probability  of  any  tester  is  at  least 
(1  —  o(l))(3/8)(3/4)  >  1/4.  Thus,  at  least  dpassive  +  1  random  labeled  examples  are  required  if 
we  wish  to  guarantee  error  at  most  1/4.  This  in  turn  implies  £l(dpassive)  examples  are  needed  to 
guarantee  error  at  most  1/3. 


39 


Upper  bound:  We  now  argue  that  0{dpassive)  examples  are  sufficient  for  testing  as  well.  Toward 
this  end,  consider  the  following  natural  testing  game.  The  adversary  chooses  a  function  /  such 
that  either  /  e  V  or  distD(f ,  V)  >  e.  The  tester  picks  a  function  A  that  maps  labeled  samples 
of  size  k  to  accept/reject.  That  is,  A  is  a  deterministic  passive  testing  algorithm.  The  payoff  to 
the  tester  is  the  probability  that  A  is  correct  when  S  is  chosen  iid  from  D  and  labeled  by  /. 

If  k  >  dpassive  then  (by  definition  of  dpassive)  we  know  that  for  any  distribution  7r  over  /  e  V 
and  any  distribution  ir'  over  /  that  are  e-far  from  V .  we  have  n')  >  1/4)  >  1/4. 

We  now  need  to  translate  this  into  a  statement  about  the  value  of  the  game.  The  key  fact  we  can 
use  is  that  if  the  adversary  uses  distribution  an  +  (1  —  q)tt'  (i.e.,  with  probability  a  it  chooses 
from  7 r  and  with  probability  1  —  a  it  chooses  from  tt'X  then  the  Bayes  optimal  predictor  has  error 
exactly 

I]  min(mr s(y),  (1  -  a)n's(y))  <  ma x(a,  1  -  a)  ^  min(7rs(j/),7rg(j/))J 
y  y 

while 

^min(7rs(y),7r^(j/))  =  1  -  (1/2)  M?/)  -  ^s(v)\  =  1  -  Ds(vr,  n'), 

y  y 

so  that  the  Bayes  risk  is  at  most  max(a,  1  —  a)(l  —  D5(7t,  n')).  Thus,  for  any  a  G  [7/16,  9/16], 
if  D5(7t,  it')  >  1/4,  the  Bayes  risk  is  less  than  (9/16)(3/4)  =  27/64.  Furthermore,  any  a 
[7/16,  9/16]  has  Bayes  risk  at  most  7/16.  Thus,  since  D,s(7r,  n')  >1/4  with  probability  >  1/4 
(and  if  D^vr,^')  <1/4  then  the  error  probability  of  the  Bayes  optimal  predictor  is  at  most 
1/2),  for  any  mixed  strategy  of  the  adversary,  the  Bayes  optimal  predictor  has  risk  less  than 
(1/4)  (7/16)  +  (3/4)  (1/2)  =  31/64. 

Now,  applying  the  minimax  theorem  we  get  that  for  k  =  dpaSsive  +  1,  there  exists  a  mixed 
strategy  A  for  the  tester  such  that  for  any  function  chosen  by  the  adversary,  the  probability  the 
tester  is  correct  is  at  least  1/2  +  7  for  a  constant  7  >  0  (namely,  1/64).  We  can  now  boost  the 
correctness  probability  using  a  constant-factor  larger  sample.  Specifically,  let  m  =  cAdpassu,e+Y) 
for  some  constant  c,  and  consider  a  sample  S  of  size  m.  The  tester  simply  partitions  the  sample 
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S  into  c  pieces,  runs  A  separatately  on  each  piece,  and  then  takes  majority  vote.  This  gives  us 
that  0(dpassive )  examples  are  sufficient  for  testing  with  any  desired  constant  success  probability 
in  (1/2,1). 

2.10.2  Coarse  Active  Testing  Dimension  (proof  of  Theorem  2.17) 

Lower  bound:  First,  we  claim  that  any  nonadaptive  active  testing  algorithm  that  uses  <  dcoarse/ c 
label  requests  must  use  more  than  nc  unlabeled  examples  (and  thus  no  algorithm  can  succeed 
using  o(dcoarse)  labels).  To  see  this,  suppose  algorithm  A  draws  nc  unlabeled  examples.  The 
number  of  subsets  of  size  dcoarse/c  is  at  most  ndcoarae/ 6  (for  dcoarse/c  >  3).  So,  by  definition  of 
dcoarse  and  the  union  bound,  with  probability  at  least  5/6,  all  such  subsets  S  satisfy  the  property 
that  Ds(7r,  it')  <  1/4.  Therefore,  for  any  sequence  of  such  label  requests,  the  labels  observed  will 
not  be  sufficient  to  reliably  distinguish  n  from  n'.  Adaptive  active  testers  can  potentially  choose 
their  next  point  to  query  based  on  labels  observed  so  far,  but  the  above  immediately  implies  that 
even  adaptive  active  testers  cannot  use  an  o(log(dcoarse))  queries. 

Upper  bound:  For  the  upper  bound,  we  modify  the  argument  from  the  passive  testing  dimension 
analysis  as  follows.  We  are  given  that  for  any  distribution  ir  over  /  e  V  and  any  distribution  rd 
over  /  that  are  e-far  from  V,  for  k  =  dcoarse+k,  we  have  Prs-^D/,:  (Ds(7t,7t')  >  1/4)  >  n  k.  Thus, 
we  can  sample  U  ~  Drn  with  m  =  0(h-nk),  and  partition  U  into  subsamples  ,Sj ,  ,S2, . . . ,  Scnk  of 
size  k  each.  With  high  probability,  at  least  one  of  these  subsamples  S)  will  have  D.s/tt,  n’)  >1/4. 
We  can  thus  simply  examine  each  subsample,  identify  one  such  that  D5(7t,  it')  >1/4,  and  query 
the  points  in  that  sample.  As  in  the  proof  for  the  passive  bound,  this  implies  that  for  any  strategy 
for  the  adversary  in  the  associated  testing  game,  the  best  response  has  probability  at  least  1/2  +  7 
of  success  for  some  constant  7  >  0.  By  the  minimax  theorem,  this  implies  a  testing  strategy  with 
success  probability  1/2  +  7  which  can  then  be  boosted  to  2/3.  The  total  number  of  label  requests 
used  in  the  process  is  only  O (dcoarse)- 

Note,  however,  that  this  strategy  uses  a  number  of  unlabeled  examples  Q(ndcoarse+1) .  Thus, 
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this  only  implies  an  active  tester  for  dcoarse  =  0(1).  Nonetheless,  combining  the  upper  and  lower 
bounds  yields  Theorem  2.17. 

2.10.3  Active  Testing  Dimension  (proof  of  Theorem  2.19) 

Lower  bound:  for  a  given  sample  U,  we  can  think  of  an  adaptive  active  tester  as  a  decision 
tree,  defined  based  on  which  example  it  would  request  the  label  of  next  given  that  the  previous 
requests  have  been  answered  in  any  given  way.  A  tester  making  k  queries  would  yield  a  decision 
tree  of  depth  k.  By  definition  of  dactive(u),  with  probability  at  least  3/4  (over  choice  of  U ),  any 
such  tester  has  error  probability  at  least  (1/4) (1  —  o(l))  over  the  choice  of  /.  Thus,  the  overall 
failure  probability  is  at  least  (3/4)  (1/4)  (1  —  o(l)  >  1/8. 

Upper  bound:  We  again  consider  the  natural  testing  game.  We  are  given  that  for  any  mixed 
strategy  of  the  adversary  with  equal  probability  mass  on  functions  in  V  and  functions  e-far  from 
V,  the  best  response  of  the  tester  has  expected  payoff  at  least  (1/4) (3/4)  +  (3/4)(l/2)  =  9/16. 
This  in  turn  implies  that  for  any  mixed  strategy  at  all,  the  best  response  of  the  tester  has  expected 
payoff  at  least  33/64  (if  the  adversary  puts  more  than  17/32  probability  mass  on  either  type 
of  function,  the  tester  can  just  guess  that  type  with  expected  payoff  at  least  17/32,  else  it  gets 
payoff  at  least  (1  —  1/16)  (9/ 16)  >  33/64).  By  the  minimax  theorem,  this  implies  existence  of 
a  randomized  strategy  for  the  tester  with  at  least  this  payoff.  We  then  boost  correctness  using 
c  •  u  samples  and  c  •  dactive(u )  queries,  running  the  tester  c  times  on  disjoint  samples  and  taking 
majority  vote. 

2.10.4  Lower  Bounds  for  Testing  LTFs  (proof  of  Theorem  2.20) 

We  complete  the  proofs  for  the  lower  bounds  on  the  query  complexity  for  testing  linear  threshold 
functions  in  the  active  and  passive  models.  This  proof  has  three  parts.  First,  in  Section  2.10.4,  we 
introduce  some  preliminary  (technical)  results  that  will  be  used  to  prove  the  lower  bounds  on  the 
passive  and  coarse  dimensions  of  testing  LTFs.  In  Section  2.10.4,  we  introduce  some  more  pre- 
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liminary  results  regarding  random  matrices  that  we  will  use  to  bound  the  active  dimension  of  the 
class.  Finally,  in  Section  2.10.4,  we  put  it  all  together  and  complete  the  proof  of  Theorem  2.20. 

Preliminaries  for  dpassive  and  dcoarse 

Fix  any  K .  Let  the  dataset  X  =  {x\,x2,  ■  ■  ■  ,%}  be  sampled  iid  according  to  the  uniform 
distribution  on  {—1,  +1}"  and  let  X  e  7 ZKxn  be  the  corresponding  data  matrix. 

Suppose  w  ~  1V(0,  Inxn)-  We  let 

z  =  Xw. 

and  note  that  the  conditional  distribution  of  z  given  X  is  normal  with  mean  0  and  (X-dependent) 
covariance  matrix,  which  we  denote  by  E.  Further  applying  threshold  function  to  z  gives  y  as 
the  predicted  label  vector  of  an  LTF. 

Lemma  2.25.  For  any  matrix  B,  log (det(B))  =  T r(log(/i)  j,  where  log(/i)  is  the  matrix  expo¬ 
nential  of  B. 

Proof.  From  [Higham,  2008],  we  know  since  every  eigenvalue  of  A  corresponds  to  the  eigen¬ 
value  of  exp(2L),  thus 


det(exp(A))  =  exp  (Tr(A))  (2.1) 

where  exp(Y)  is  the  matrix  exponential  of  A.  Taking  logarithm  of  both  sides  of  (2.1),  we  get 

log(det(exp(v4)))  =  Tr(A)  (2.2) 

Let  B  =  exp(Y)  (thus  A  =  log (B)).  Then  (2.2)  can  rewritten  as  log (det(B))  =  Tr( log  B).  □ 

Lemma  2.26.  For  sufficiently  large  n,  and  a  value  K  =  ( 1  ( \Jn/  \og(K /<)  ) ) ,  with  probability  at 
least  1  —  5  (over  X ), 

ll«’(./^)|x-A,(0./)l|<l/4. 
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Proof.  Let  l  be  the  feature  index.  For  a  pair  x,  and  x,-. 


n 


|{/  :  xu  =  Xji} |  -  - 


By  Hoeffding  Inequality,  with  probability  1  —  5, 


n  log  |  N 

>\I^A)<S 


x?:  X, 


=  \{l  :  Xu  =  Xji} I  -  \{l  :  Xu  f  Xji}\ 

=  2\{l  :  xu  =  Xji}\  n  e  I  -2i 


niogf  n  logf 


By  union  bound, 


P  3i,j,  such  that  x[ xj  f 


2 n  log 


2K2 


2  n  log 


2/l2 


<  K2f—  =  5 


K2 


For  the  remainder  of  the  proof  we  suppose  the  (probability  1  —  5)  event 


V/../.x/x;  e 


-v/2nlog(2/l2/5),  \j2n  log(2A'2/5) 


occurs. 


(2.3) 


Cov(zi/ y/n,Zj/ y/n\X)  = 


n 


=  -E 
n 


=  -E 
n 


=  -E 
n 


Ewi-^)(E  Wi  ■  Xji)  \X 
1  =  1  1=1 

n,n 


Z,m=l,l 


E  wfxuXjilX 


=  -E 
n 


T/xuXjilX 


n 


=  -xi  Xj  G 
n  J 


n 


21og(2/i2/5)  /21og(2/l2/5) 


n 


because  E[w^wm]  =  0  (for  l  f  m)  and  Ef^f]  =  1.  Let  /3  =  ^21°g(2A27j)- 
matrix,  with  E&  =  1  for  i  =  1,  •  •  •  ,  A”  and  Ey-  e  |  —ff  /)]  for  all  i  f  j. 

Let  Pi  =  N(0,  EAxA)  and  P2  =  JV(0,  JAxA).  As  the  density 

.  ,  1  .  1  T  . 

Pi(z)  =  — .  - exp( — z  E  z) 

1  '  v/(27r)Adet(E)  1  2  ; 


Thus  E  is  a  K  x  K 
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and  the  density 


po  (z)  =  =  exp(— -z1  z) 

Then  L,  distance  between  the  two  distributions  P i  and  P2 

\dP2  -  dP1 1  <  2a/ K(Pi,  P2)  =  2a/ (1/2)  log det(S), 

where  this  last  equality  is  by  [Davis  and  Dhillon,  2006].  By  Lemma  2.25,  log(det(E))  = 
T r(log(E)).  Write  A  =  S  —  I .  By  the  Taylor  series 

OO  _■  OO 

iog(-f + a)  =  -  x;  7u  -  u + ^))‘  =  -  E 

I  ^  -1  ^ 

2=1  2=1 

OO  1 

Thus  Tr(log(/  +  2L))  =  ^  -Tr((-A)*).  (2.4) 

i=  1  .  .. 

Every  entry  in  A 1  can  be  expressed  as  a  sum  of  at  most  /T  terms,  each  of  which  can 
be  expressed  as  a  product  of  exactly  i  entries  from  A.  Thus,  every  entry  in  A1  is  in  the  range 
\—Kl~l  j3l ,  Kl~l j31].  This  means  T r(Al)  <  Kl{P.  Therefore,  if  Kj3  <  1/2,  since  Tr(A )  =  0, 
the  expansion  of  Tr(log(/  +  A))  <  Y^=2  Kl[A  =  O  (^K2 j . 

In  particular,  for  some  K  =  Q(^/n/  log(K/5)),  Tr(log(J  +  A))  is  bounded  by  the  appropri¬ 
ate  constant  to  obtain  the  stated  result.  □ 

Preliminaries  for  dacuve 

Given  an  n  x  m  matrix  A  with  real  entries  {aij}ie[n]  je[m],  the  adjoint  (or  transpose  -  the  two  are 
equivalent  since  A  contains  only  real  values)  of  A  is  the  m  x  n  matrix  A*  whose  (i,  j)-th  entry 
equals  ahi.  Let  us  write  Ai  >  A2  >  •  •  •  >  Am  to  denote  the  eigenvalues  of  \JA*A.  These  values 
are  the  singular  values  of  A.  The  matrix  A*  A  is  positive  semidefinite,  so  the  singular  values  of 
A  are  all  non-negative.  We  write  Amax  (/l)  =  Ai  and  Aimn(  /1)  =  Am  to  represent  its  largest  and 
smallest  singular  values.  Finally,  the  induced  norm  (or  operator  norm )  of  A  is 

, ,  , ,  II Ax 1 1 0  .I  .  , 

L4  =  max  — — -=  =  max  \\Ax  L- 

a;SRm\{0}  ||a;||2  a;eRm:||a;|||=l 
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For  more  details  on  these  definitions,  see  any  standard  linear  algebra  text  (e.g.,  [Shilov,  1977]). 
We  will  also  use  the  following  strong  concentration  bounds  on  the  singular  values  of  random 
matrices. 

Lemma  2.27  (See  [Vershynin,  2012,  Cor.  5.35]).  Let  A  be  an  n  x  m  matrix  whose  entries  are 
independent  standard  normal  random  variables.  Then  for  any  t  >  0,  the  singular  values  of  A 
satisfy 

\fn  -  y/m  —  t  <  Amin(A)  <  Amax(A)  <  \fn  +  yfm  +  t  (2.5) 


with  probability  at  least  1  —  2e“*2//2. 

The  proof  of  this  lemma  follows  from  Talagrand’s  inequality  and  Gordon’s  Theorem  for 
Gaussian  matrices.  See  [Vershynin,  2012]  for  the  details.  The  lemma  implies  the  following 
corollary  which  we  will  use  in  the  proof  of  our  theorem. 

Corollary  2.28.  Let  A  be  an  n  x  m  matrix  whose  entries  are  independent  standard  normal 
random  variables.  For  any  0  <  t  <  yfn—yfm,  the  rrixrn  matrix  ^  A*  A  satisfies  both  inequalities 


-A* A  —  I\\  <  3 


rn  + 1 


n 


and  det  {^A* A)  >  e 


( y/rri+t) 2  |  ^  y/rn+t 


(2.6) 


with  probability  at  least  1  —  2e  f2//2. 


Proof.  When  there  exists  0  <  z  <  1  such  that  1  —  z  <  ^Amax( /l)  <1  +  2,  the  identity 

^Amax(AL)  =  ||^4||  =  max|N|2=1  ||^Ar||2  implies  that 

ii  1 1 2 


1  -  2z  <  (1 


<  (1  +  ^)2  <1  +  32. 


These  inequalities  and  the  identity  ||lAl*Al  —  J||  =  max||x||2=1  ||^24x|||  —  1  imply  that  —2z  < 
||  ^A*A  —  J||  <  32.  Fixing  2  =  v/^t  and  applying  Lemma  2.27  completes  the  proof  of  the  first 
inequality. 

Recall  that  X\  <  ■  ■  ■  <  Xm  are  the  eigenvalues  of  \JA*A.  Then 


det(MM)  = 


det(\/  A*  A)2  (Ai-’-A  m)s 


n 


n 


2\  m 


>1^1  = 


n 


Amin  (7I) 


2\  m 


n 
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Lemma  2.27  and  the  elementary  inequality  1  +  x  <  ex  complete  the  proof  of  the  second  inequal¬ 
ity.  □ 

Proof  of  Theorem  2.20 

Theorem  2.20  (Restated).  For  linear  threshold  functions  under  the  uniform  distribution  on 
{-1,  l}n,  dPasSive  =  n(v/,n/  log(n))  and  dactive  =  0((n/ log(n))1/3). 

Proof  Let  K  be  as  in  Lemma  2.26  for  5  =  1/4.  Let  D  =  {(xi,yi), . . . ,  (xK,yK)}  denote 
the  sequence  of  labeled  data  points  under  the  random  LTF  based  on  w.  Furthermore,  let  D'  = 
{(xi,y[), . . . ,  (xK,  y'K )}  denote  the  sequence  of  labeled  data  points  under  a  target  function  that 
assigns  an  independent  random  label  to  each  data  point.  Also  let  z,  =  (1/ y/n)~wTXi,  and  let 
z'  ~  N( 0,  IKxK).  Let  E  =  {(xi,  zl)i . . . ,  (xK,  zK )}  and  E'  =  {(xi,  z'J, . . (xK,  z'K)}.  Note 
that  we  can  think  of  y,  and  y\  as  being  functions  of  z,  and  z',  respectively.  Thus,  letting  X  = 
{xi, . . . ,  xk},  by  Lemma  2.26,  with  probability  at  least  3/4, 


||Pd|x  —  Pd'IxII  <  II^bix  —  Pe'|x||  <  1/4. 

This  suffices  for  the  claim  that  dpassive  =  Q(K)  =  Q(^n/  log(n)). 

Next  we  turn  to  the  lower  bound  on  dactive-  Let  us  now  introduce  two  distributions  Vyes 
and  Vno  over  linear  threshold  functions  and  functions  that  (with  high  probability)  are  far  from 
linear  threshold  functions,  respectively.  We  draw  a  function  /  from  Vyes  by  first  drawing  a 
vector  w  ~  J\f(0,Inxn)  from  the  n-dimensional  standard  normal  distribution.  We  then  define 
/:i4  sgn(^x  •  w).  To  draw  a  function  g  from  Vno,  we  define  g(x)  =  sgn(yx)  where  each 
yx  variable  is  drawn  independently  from  the  standard  normal  distribution  Af( 0, 1). 

Let  X  G  Rnxg  be  a  random  matrix  obtained  by  drawing  q  vectors  from  the  n-dimensional 
normal  distribution  J\f( 0,  Inxn)  and  setting  these  vectors  to  be  the  columns  of  X.  Equivalently,  X 
is  the  random  matrix  whose  entries  are  independent  standard  normal  variables.  When  we  view  X 
as  a  set  of  q  queries  to  a  function  /  ~  Vyes  or  a  function  g  Vno,  we  get  /(X)  =  sgn(-^Xw) 
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and  g(X)  =  sgn(yx).  Note  that  ^Xw  ~  J\f( 0,  £X*X)  and  yx  ~  Af(0,Iqxq).  To  apply 
Lemma  2.21  it  suffices  to  show  that  the  ratio  of  the  pdfs  for  both  these  random  variables  is 
bounded  by  |  for  all  but  |  of  the  probability  mass. 

The  pdf  p  :  M9  — )■  M  of  a  ("/-dimensional  random  vector  from  the  distribution  J\fqxg( 0,  E)  is 

p(x)  =  (2? r)"2  det(S)-^e-^Ts'la;. 

Therefore,  the  ratio  function  r  :  W  — >  R  between  the  pdfs  of  4-Xw  and  of  yx  is 

r(x)  =  det(^X*X)-^e^T((-x’xrl-/)a;. 

Note  that 


x 


\ax‘x)-'-i)x<  ||(ix*x; 


-1 


\x 


|±X*X-/| 


x 


2 

2) 


so  by  Lemma  2.27  with  probability  at  least  1  —  2e  *2/2  we  have 


r(x)  <  e‘ 


(yq+tr  |  2^+t  \  '  oVv+t 


+3- 


2 

2 


By  a  union  bound,  for  U  ~  J\f( 0,  Inxn)w,  u  e  N  with  u  >  q,  the  above  inequality  for  r(x)  is  true 
for  all  subsets  of  [/  of  size  q,  with  probability  at  least  1  —  uq2e~t2^2.  Fix  q  =  7i3/(50(ln(u))3) 
and  t  =  2  a J q  ln(u).  Then  uq 2e-*2/2  <  2m-9,  which  is  <  1/4  for  any  sufficiently  large  n.  When 
1 1 x 1 1 2  <  3 q  then  for  large  n,  r(x)  <  e74/625  <  |.  To  complete  the  proof,  it  suffices  to  show  that 
when  x  ~  J\f( 0,  Iqxq),  the  probability  that  ||a;|||  >  3 q  is  at  most  \2~q.  The  random  variable  ||a;||2 
has  a  x2  distribution  with  q  degrees  of  freedom  and  expected  value  E  j  |  x  111  =  ELi  Exi  =  7- 
Standard  concentration  bounds  for  y2  variables  imply  that 

Pr  [||i|||  >  3,]  <  e-iq  <  12- , 

X~N  (0,/qxq) 

as  we  wanted  to  show.  Thus,  Lemma  2.21  implies  err*(DTg,  Fair(7r,  n',  U))  >1/4  holds  when¬ 
ever  this  r(x)  inequality  is  satisfied  for  all  subsets  of  U  of  size  q;  we  have  shown  this  happens 
with  probabiliity  greater  than  3/4,  so  we  must  have  dactive  >  q.  □ 
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If  we  are  only  interested  in  bounding  dcoarse,  the  proof  can  be  somewhat  simplified.  Specifi¬ 
cally,  taking  5  =  n~K  in  Lemma  2.26  implies  that  with  probability  at  least  1  —  n~K , 

||Pd|x  —  Pd'IxII  <  II^eix  —  IPfi'ixll  <  1/4, 

which  suffices  for  the  claim  that  dcoarse  =  O(iT),  where  K  =  fl(^/n/K  log(n)):  in  particular, 

dcoarse  =  0( (n/ log(n) ) 1/3) . 

2.11  Testing  Semi-Supervised  Learning  Assumptions 

We  now  consider  testing  of  common  assumptions  made  in  semi-supervised  learning  [Chapelle, 
Schlkopf,  and  Zien,  2006],  where  unlabeled  data,  together  with  assumptions  about  how  the  target 
function  and  data  distribution  relate,  are  used  to  constrain  the  search  space.  As  mentioned  in 
Section  2.4,  one  such  assumption  we  can  test  using  our  generic  disjoint-unions  tester  is  the 
cluster  assumption,  that  if  data  lies  in  N  identifiable  clusters,  then  points  in  the  same  cluster 
should  have  the  same  label.  We  can  in  fact  achieve  the  following  tighter  bounds: 

Theorem  2.29.  We  can  test  the  cluster  assumption  with  active  testing  using  0(N/e)  unlabelecl 
examples  and  0(  1/e)  queries. 

Proof.  Let  p,  i  and  pi0  denote  the  probability  mass  on  positive  examples  and  negative  examples 
respectively  in  cluster  i,  so  pt\  +  pi0  is  the  total  probabilty  mass  of  cluster  i.  Then  dist(f,  V)  = 
Yi  min  (pj  i ,  pt0)-  Thus,  a  simple  tester  is  to  draw  a  random  example  x,  draw  a  random  example 
y  from  x’s  cluster,  and  check  if  fix)  =  f(y).  Notice  that  with  probability  exactly  dist(f,V), 
point  x  is  in  the  minority  class  of  its  own  cluster,  and  conditioned  on  this  event,  with  probability 
at  least  1/2,  point  y  will  have  a  different  label.  It  thus  suffices  to  repeat  this  process  0(l/e) 
times.  One  complication  is  that  as  stated,  this  process  might  require  a  large  unlabelecl  sample, 
especially  if  x  belongs  to  a  cluster  i  such  that  pi0  +pn  is  small,  so  that  many  draws  are  needed  to 
find  a  point  y  in  x’s  cluster.  To  achieve  the  given  unlabeled  sample  bound,  we  initially  draw  an 
unlabeled  sample  of  size  0(N/e )  and  simply  perform  the  above  test  on  the  uniform  distribution 
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U  over  that  sample,  with  distance  parameter  e/2.  Standard  sample  complexity  bounds  [Vapnik, 
1998]  imply  that  0(N/e )  unlabeled  points  are  sufficient  so  that  if  distD(f,V)  >  e  then  with 
high  probability,  distu(f,  V)  >  e/2.  □ 

We  now  consider  the  property  of  a  function  having  a  large  margin  with  respect  to  the  un¬ 
derlying  distribution:  that  is,  the  distribution  D  and  target  /  are  such  that  any  point  in  the 
support  of  D\f=i  is  at  distance  7  or  more  from  any  point  in  the  support  of  D\f=0.  This  is  a 
common  property  assumed  in  graph-based  and  nearest-neighbor-style  semi-supervised  learning 
algorithms  [Chapelle,  Schlkopf,  and  Zien,  2006].  Note  that  we  are  not  additionally  requiring 
the  target  to  be  a  linear  separator  or  have  any  special  functional  form.  For  scaling,  we  assume 
that  points  lie  in  the  unit  ball  in  Rd,  where  we  view  d  as  constant  and  I/7  as  our  asymptotic 
parameter.8  Since  we  are  not  assuming  any  specific  functional  form  for  the  target,  the  number 
of  labeled  examples  needed  for  learning  could  be  as  large  as  by  having  a  distribution 

with  support  over  0(1  /^rf)  points  that  are  all  at  distance  7  from  each  other  (and  therefore  can 
be  labeled  arbitrarily).  Furthermore,  passive  testing  would  require  Q ( 1  /7'/,/2 )  samples  as  this 
specific  case  encodes  the  cluster-assumption  setting  with  N  =  0(1  /7rf)  clusters.  We  will  be  able 
to  perform  active  testing  using  only  0(  1/e)  label  requests. 

First,  one  distinction  between  this  and  other  properties  we  have  been  discussing  is  that  it 
is  a  property  of  the  relation  between  the  target  function  /  and  the  distribution  D\  i.e.,  of  the 
combined  distribution  Df  =  ( I) ,  /)  over  labeled  examples.  As  a  result,  the  natural  notion  of 
distance  to  this  property  is  in  terms  of  the  variation  distance  of  Df  to  the  closest  I)t  satisfying 
the  property.9  Second,  we  will  have  to  also  allow  some  amount  of  slack  on  the  7  parameter  as 

s  Alternatively  points  could  lie  in  a  d-dimensional  manifold  in  some  higher-dimensional  ambient  space,  where  the 
property  is  defined  with  respect  to  the  manifold,  and  we  have  sufficient  unlabeled  data  to  “unroll”  the  manifold  using 
existing  methods  [Chapelle,  Schlkopf,  and  Zien,  2006,  Roweis  and  Saul,  2000,  Tenenbaum,  Silva,  and  Langford, 
2000], 

9As  a  simple  example  illustrating  the  issue,  consider  X  =  [0, 1],  a  target  /  that  is  negative  on  [0, 1/2)  and 
positive  on  [1/2, 1],  and  a  distribution  D  that  is  uniform  but  where  the  region  [1/2, 1/2  +  7]  is  downweighted  to 
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well.  Specifically,  our  tester  will  distinguish  the  case  that  Df  indeed  has  margin  7  from  the  case 
that  the  Df  is  e-far  from  having  margin  7'  where  7'  =  7(1  —  1/c)  for  some  constant  c  >  1;  e.g., 
think  of  7'  =  7/2.  This  slack  can  also  be  seen  to  be  necessary  (see  discussion  following  the 
proof  of  Theorem  2.13).  In  particular,  we  have  the  following. 

Theorem  2.13  (Restated).  For  any  7,  7'  =  7(1  —  1/c)  for  constant  c  >  1,  for  data  in  the  unit 
ball  in  Rd  for  constant  d,  we  can  distinguish  the  case  that  Df  has  margin  7  from  the  case  that  Df 
is  e-far  from  margin  7'  using  Active  Testing  with  0(1  /  (flde2))  unlabeled  examples  and  0(  1/e) 
label  requests. 

Proof  First,  partition  the  input  space  X  (the  unit  ball  in  Rd)  into  regions  R1,  R2, . . . ,  Rn  of 
diameter  at  most  7/ (2c).  By  a  standard  volume  argument,  this  can  be  done  using  N  =  0(l/yd) 
regions  (absorbing  “c”  into  the  00).  Next,  we  run  the  cluster-property  tester  on  these  N  regions, 
with  distance  parameter  e/4.  Clearly,  if  the  cluster-tester  rejects,  then  we  can  reject  as  well. 
Thus,  we  may  assume  below  that  the  total  impurity  within  individual  regions  is  at  most  e/4. 

Now,  consider  the  following  weighted  graph  G1.  We  have  N  vertices,  one  for  each  of  the  N 
regions.  We  have  an  edge  (i,j)  between  regions  R,  and  Rj  if  diam (R{  U  Rf)  <  7.  We  define 
the  weight  w(i,j )  of  this  edge  to  be  min (D[Ri\,  D[Rf\)  where  D[R\  is  the  probability  mass  in 
R  under  distribution  I).  Notice  that  if  there  is  no  edge  between  region  R,  and  Rj,  then  by  the 
triangle  inequality  every  point  in  Rt  must  be  at  distance  at  least  7'  from  every  point  in  Rj.  Also, 
note  that  each  vertex  has  degree  0(cd)  =  0(1),  so  the  total  weight  over  all  edges  is  0(1).  Finally, 
note  that  while  algorithmically  we  do  not  know  the  edge  weights  precisely,  we  can  estimate  all 
edge  weights  to  ±e/ (4 M),  where  M  =  0(N )  is  the  total  number  of  edges,  using  the  unlabeled 
sample  size  bounds  given  in  the  Theorem  statement.  Let  w(i,j )  denote  the  estimated  weight  of 
edge  (i,j). 

Let  Ewitness  be  the  set  of  edges  (i,j)  such  that  one  endpoint  is  majority  positive  and  one  is 

have  total  probability  mass  only  l/2n.  Such  a  D  f  is  l/2”-close  to  the  property  under  variation  distance,  but  would 
be  nearly  1/2-far  from  the  property  if  the  only  operation  allowed  were  to  change  the  function  /. 
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majority  negative.  Note  that  if  Df  satisfies  the  7-margin  property,  then  every  edge  in  Ewitness 
has  weight  0.  On  the  other  hand,  if  Df  is  e-far  from  the  7' -margin  property,  then  the  total  weight 
of  edges  in  Ewitness  is  at  least  3e/4.  The  reason  is  that  otherwise  one  could  convert  I)j  to  D'j 
satisfying  the  margin  condition  by  zeroing  out  the  probability  mass  in  the  lightest  endpoint  of 
every  edge  (i,j)  G  Ewitness,  and  then  for  each  vertex,  zeroing  out  the  probability  mass  of  points 
in  the  minority  label  of  that  vertex.  (Then,  renormalize  to  have  total  probability  1.)  The  first  step 
moves  distance  at  most  3e/4  and  the  second  step  moves  distance  at  most  e/4  by  our  assumption 
of  success  of  the  cluster-tester.  Finally,  if  the  true  total  weight  of  edges  in  Ewitness  is  at  least  3e/4 
then  the  sum  of  their  estimated  weights  w(i,j )  is  at  least  e/2.  This  implies  we  can  perform  our 
test  as  follows.  For  0(l/e)  steps,  do: 

1.  Choose  an  edge  (i,  j)  with  probability  proportional  to  w(i,  j). 

2.  Request  the  label  for  a  random  x  G  R,  and  y  G  R:1.  If  the  two  labels  disagree,  then  reject. 

If  Df  is  e-far  from  the  7' -margin  property,  then  each  step  has  probability  w(Ewitness) / w(E)  = 
0(e)  of  choosing  a  witness  edge,  and  conditioned  on  choosing  a  witness  edge  has  probability  at 
least  1/2  of  detecting  a  violation.  Thus,  overall,  we  can  test  using  0(l/e)  labeled  examples  and 
O  ( 1  /  (72de2 ) )  unlabeled  examples .  □ 

On  the  necessity  of  slack  in  testing  the  margin  assumption:  Consider  an  instance  space  X  = 
[0,  l]2  and  two  distributions  over  labeled  examples  D\  and  D2.  Distribution  D\  has  probability 
mass  l/2”+1  on  positive  examples  at  location  (0,i/2”)  and  negative  examples  at  (7/  %/2n)  for 
each  i  =  1,2, .. .  ,2”,  for  7'  =  7(1  —  1/22”).  Notice  that  D\  is  1/2-far  from  the  7-margin 
property  because  there  is  a  matching  between  points  in  the  support  of  Zi>i |/=i  and  points  in  the 
support  of  D\\f=o  where  the  matched  points  have  distance  less  than  7.  On  the  other  hand,  for 
each  i  =  1,2,...,  2”,  distribution  D2  has  probability  mass  1/2”  at  either  a  positive  point  (0,  i/2”) 
or  a  negative  point  (7',  i/2”),  chosen  at  random,  but  zero  probability  mass  at  the  other  location. 
Distribution  D2  satisfies  the  7-margin  property,  and  yet  D\  and  I)  >  cannot  be  distinguished  using 
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a  polynomial  number  of  unlabeled  examples. 
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Chapter  3 


Testing  Piecewise  Real- Valued  Functions 


Abstract 

This  chapter  extends  the  model  of  the  previous  chapter  to  the  setting  of  testing  properties  of  real¬ 
valued  functions.  Specifically,  it  establishes  a  technique  for  testing  d-piecewise  constantness  of 
a  real- valued  function. 


3.1  Piecewise  Constant 


For  this  section,  let  NS<5  =  N§J(x)da;,  where  NS^rr)  =  ^  J I[/(x)  f  f{y)\dy. 

Proposition  3.1.  Fix  5  >  0  and  let  f  :  [0, 1]  — *  R  be  a  d-piecewise  constant  function.  Then 


Proof.  For  any  fixed  b  e  [0, 1],  the  probability  that  x  <  b  <  y  when  x  ~  U( 0, 1)  and  y  ~ 

U{x  —  8,  x  +  5)  is 


Prfrc  <  b  <  y\  = 


x,y 


/  Pr  \y  >  b]dt  = 

Jq  yr^U(b—t—8,b—t+5) 


5  -t  ,  5 

,  ~WAt  =  4' 


Similarly,  Pr XjV[y  <  b  <  x]  =  f.  So  the  probability  that  b  lies  between  x  and  y  is  at  most  f . 

When  /  is  a  d-piecewise  constant  function,  f(x)  f  f(y )  only  if  at  least  one  of  the  boundaries 
bi, . . . ,  bd- 1  of  the  regions  of  /  lie  in  between  x  and  y.  So  by  the  union  bound,  Pr[/(x)  f 
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f{y)]  <  (d  —  1)  (<5/2) .  Note  that  if  b  is  within  distance  5  of  0  or  1,  the  probability  is  only 
lower.  □ 

2 

Lemma  3.2.  Fix  5  =  Let  f  :  [0, 1]  — >  M  be  any  function  with  noise  sensitivity  N§,j(/j  < 
(d  —  1)|(1  +  f).  Then  f  is  e-close  to  a  d-piecewise  constant  function. 

Proof  The  proof  proceeds  in  two  steps:  We  first  show  that  /  is  |-close  to  a  (1  +  (d  —  1)  (1  +  §))- 
piecewise  constant  function,  and  then  we  show  that  every  (1  +  (d  —  1)  (1  +  |))-piecewise  constant 
function  is  | -close  to  a  d-piecewise  constant  function. 

For  eacy  t/6M,  consider  the  function  fj-  :  [0, 1]  — >  [0, 1]  defined  by 

fs(x) =  fx  s  W)  =  yldt 

The  function  f$  is  the  convolution  of  fy  —  I[f  —  y]  and  the  uniform  kernel  0  :  M  — >  [0, 1] 
defined  by  f(x)  =  ^l[|x|  <  5]. 

Note  that  for  any  x,  there  is  at  most  one  value  y  G  M  for  which  fjf(x)  >  1/2.  Fix  r  = 
^N§>s(f).  We  introduce  the  function  g*  :  [0, 1]  — >  R  U  {*}  by  setting 

{argmaxyeR  //  (x)  when  supyeK  f]  (x)  >  1  -  r, 

*  otherwise 

for  all  x  E  [0, 1].  Finally,  we  define  g  :  [0, 1]  — >  {0, 1}  by  setting  g(x)  =  g*(z )  where  z  <  x  is 
the  largest  value  for  which  g*(z)  f  *.  (If  no  such  z  exists,  we  let  g(x)  =  g*(z)  for  the  smallest 
value  z  >  x  with  g*(z)  f  *;  if  that  does  not  exist,  then  for  completeness  define  g(x)  =  0 
everywhere,  though  this  case  will  not  come  up). 

We  first  claim  that  dist(f ,  g)  <  To  see  this,  note  that 

dist(f,g)  =  Pr [f(x)  f  g(x)} 

X 

<  P*[g*(x)  =  *]  +  Pr[*  f  g*(x)  f  f(x)} 

X  X 

=  Pr[sup  /|(x)  <  1  —  r]  +  Pr[  sup  >  1  —  r]. 

1  yeR  1  j,eR\{/(*)} 
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Because  r  <  1/2,  at  most  one  y  can  have  fg(x)  >  1  —  r,  so  that  both  supyeR  fg(x)  <  1  —  r 
and  supyeR\rfM-,  fg(x)  >  1  —  r  imply  f£^x\s)  <  1  —  r;  thus,  since  these  events  are  disjoint, 
the  above  sum  of  probabilities  is  at  most 

Pr [fs(x)(x) 

X 

Now  observe  that  NS g(f,x)  =  1  —  f£^x\x)  and  that  E^NS g(f,x)  =  NSg(f).  From  these 
identities  and  Markov’s  inequality,  we  have  that 

Pr [fsiX\x)  <  1  -  r]  =  Pr[l  -  f£{x\x)  >r}=  Pr[NS g(f,x)  >  r]  <  =  £. 

x  x  x  T  4 


We  now  want  to  show  that  g  is  m-pieccwisc  constant,  for  some  m  <  d(  1  +  f ).  Since  each  // 
is  the  convolution  of  I  [/  =  y]  with  a  uniform  kernel  of  width  25,  it  is  Lipschitz  continuous  (with 
Lipschitz  constant  ^).  Also  recall  that  r  <  1/2,  and  at  most  one  value  y  can  have 
for  any  given  x.  Thus,  if  we  consider  any  two  points  x,  z  £  [0, 1]  with  *  ^  g*{x )  ^  g*(z )  ^  * 
and  x  <  z,  it  must  be  that  \x  —  z\  >  252(1  —  r),  and  that  there  is  at  least  one  point  t  £  ( x ,  z) 
with  sup,yeM  fg(t)  =  1/2.  Since  each  /|  is  ^-Lipschitz, 


SO  IS 


y,  so  that  we  have 


rt+ 25(1— r) 
/ 1— 25(|— r) 


//°°(s)ds  < 


rt+25(t-r) 


<  2 


h-25(\-T)  y& R 

r2<5(t-r)  i 

/  ^ 


sup  f/ (s)ds 


Therefore, 


J  NSj(/,s)ds  =  J  (1  -  //w(s))ds  >(z-x)~  25(1  -  r)(|  -  r) 

>  252(1  t)  25(1  _  T)(?  -  r)  =  25(1  _  T)(l  +  T))  =  25(1  _  T2). 


Since  any  x  with  g*(x)  ^  *  has  g(x)  =  g*(x),  and  since  g  is  defined  to  be  continuous  from 
the  right  on  [0, 1],  for  every  transition  point  x  >  0  for  g  (i.e.,  a  point  x  for  which  there  exist 
arbitrarily  close  points  z  having  g(z)  ^  g(x)),  there  is  a  point  z  <  x  such  that  every  t  £  (z,  x) 
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has  g(t)  =  g*(z )  ^  g*(x)  =  g(x);  combined  with  the  above,  we  have  that  NS$(f,  s)ds  > 
2<5(^  —  r2).  Altogether,  if  g  has  m  such  transition  points,  then 

N! Mf)  =  £  NS s(f,  s)ds  >  m28{1-  -  r2). 

By  assumption,  NS  $(f)  <  (d  —  1)|(1  +  |).  Therefore,  we  must  have 


(d- l)5(l  +  f) 


l  +  ! 


i  + ! 


m <  v".-,rv  :.4'  < (d - i)q — ^ < (d  - i)7^fe < (d - i)(i  +  -). 


45(|  -r2) 


1  —  4r2 


(1  -  2r)s 


In  particular,  this  means  9  is  (m  +  l)-piecewise  constant,  for  an  m  <  (d  —  1)(1  +  |). 


Finally,  we  want  to  show  that  any  (m+1) -piecewise  constant  function,  for  m  <  (d— 1)(1+|), 
is  ^ -close  to  a  ("/-piecewise  constant  function.  Let  £\, . . . ,  £m+ 1  represent  the  lengths  of  the  m 

regions  in  g.  Clearly,  £i-\ - f-  lm+i  =  1,  so  there  must  be  a  set  S  of  (m  +  1)  —  d  <  ( d  —  l)e/2 

regions  in  g  with  total  length 

\  ^  p  <  (m  +  1)  ~  d  (d  -  l)e/2  e 

(m  +  1)  -l  +  (d-l)(l  +  f)  2' 

Consider  the  function  h  :  [0, 1]  — »  {0, 1}  obtained  by  removing  the  regions  in  S  from  g  (i.e., 
for  each  a;  in  a  region  indexed  by  1  <E  S,  setting  h(x)  =  h(z)  for  z  a  point  in  the  nearest  region 
to  x  that  is  not  indexed  by  some  j  G  S).  The  function  h  is  then  ('/-piecewise  constant,  and 

dist(g,  h)  <  f .  This  completes  the  proof,  since  dist(f ,  h)  <  dist(f,  g)  +  dist(g ,  h)  <  e.  □ 


With  these  results,  applying  the  same  technique  as  used  in  the  unions  of  intervals  method  in 
the  previous  chapter  yields  a  tester  for  (/-piecewise  constant  functions. 
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Chapter  4 


Leamability  of  DNF  with 
Representation-Specific  Queries 


Abstract 

1  We  study  the  problem  of  PAC  learning  the  space  of  DNF  functions  with  a  type  of  query  specific 
to  the  representation  of  the  target  DNF.  Specifically,  given  a  pair  of  positive  examples  from  a 
polynomial-sized  sample,  our  query  asks  whether  the  two  examples  satisfy  a  term  in  common  in 
the  target  DNF.  We  show  that  a  number  of  interesting  special  types  of  DNF  targets  are  efficiently 
properly  learnable  with  this  type  of  query,  though  the  general  problem  of  learning  an  arbitrary 
DNF  target  under  an  arbitrary  distribution  is  no  easier  than  in  the  traditional  PAC  model.  Specif¬ 
ically,  we  find  that  2-term  DNF  are  efficiently  properly  learnable  under  arbitrary  distributions,  as 
are  disjoint  DNF.  We  further  study  the  special  case  of  learning  under  the  uniform  distribution, 
and  find  that  several  other  general  families  of  DNF  functions  are  efficiently  properly  learnable 
with  these  queries,  including  functions  with  0(log(n))  relevant  variables,  and  monotone  DNF 
functions  for  which  each  variable  appears  in  at  most  0(log(n))  terms. 

We  also  study  a  variety  of  generalizations  of  this  type  of  query.  For  instance,  consider  in- 
1  Joint  work  with  Avrim  Blum  and  Jaime  Carbonell. 
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stead  the  ability  to  ask  how  many  terms  a  pair  of  examples  satisfy  in  common,  where  the  exam¬ 
ples  are  again  taken  from  a  polynomial-sized  sample.  In  this  case,  we  can  efficiently  properly 
learn  several  more  general  classes  of  DNF,  including  DNF  having  0(log(n))  terms,  DNF  having 
0(log(n))  relevant  variables,  DNF  for  which  each  example  can  satisfy  at  most  0(1)  terms,  all 
under  arbitrary  distributions.  Other  possible  generalizations  of  the  query  include  allowing  the 
algorithm  to  ask  the  query  for  an  arbitrary  number  of  examples  from  the  sample  at  once  (rather 
than  just  two),  or  allowing  the  algorithm  to  ask  the  query  for  examples  of  its  own  construction; 
we  show  that  both  of  these  generalizations  allow  for  efficient  proper  learnability  of  arbitrary  DNF 
functions  under  arbitrary  distributions. 

4.1  Introduction 

Consider  a  bank  aiming  to  use  machine  learning  to  identify  instances  of  financial  fraud.  To 
do  so,  the  bank  would  have  experts  label  past  transactions  as  fraudulent  or  not,  and  then  run  a 
learning  algorithm  on  the  resulting  labeled  data.  However,  this  learning  problem  might  be  quite 
difficult  because  of  the  existence  of  multiple  intrinsic  types  of  fraud,  with  each  positive  example 
perhaps  involving  multiple  types.  That  is,  the  target  might  be  a  DNF  formula,  a  class  for  which 
no  efficient  algorithms  are  known. 

Yet  in  such  a  case,  perhaps  the  experts  performing  the  labeling  could  be  called  on  to  provide  a 
bit  more  information.  In  particular,  suppose  that  given  two  positive  examples  of  fraud,  the  experts 
could  indicate  whether  or  not  the  two  examples  are  similar  in  the  sense  of  having  at  least  one 
intrinsic  type  of  fraud  (at  least  one  term)  in  common.  Or  perhaps  the  experts  could  indicate  how 
similar  the  examples  are  (how  many  terms  in  common  they  satisfy).  This  is  certainly  substantially 
more  information.  Can  it  be  used  to  learn  DNF  formulas  and  their  natural  subclasses  efficiently? 

In  our  work,  we  study  the  problem  of  learning  DNF  formulas  and  other  function  classes 
using  such  pairwise,  representation-dependent  queries.  Specifically,  we  consider  queries  of  the 
form,  “Do  these  two  positive  examples  satisfy  at  least  one  term  in  common  in  the  target  DNF 
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formula?”  (we  call  these  boolean  similarity  queries )  and  “How  many  terms  in  common  do  these 
two  positive  examples  satisfy?”  (we  call  these  numerical  similarity  queries). 

4.1.1  Our  Results 

We  begin  with  a  somewhat  surprising  negative  result,  that  learning  general  DNF  formulas  under 
arbitrary  distributions  from  boolean  similarity  queries  is  as  hard  as  PAC-learning  DNF  formulas 
without  them.  This  result  uses  the  equivalence  between  group  learning,  weak  learning,  and 
strong  learning.  In  contrast,  learning  disjoint  DNF  (a  class  that  contains  decision  trees)  with 
such  queries  is  quite  easy.  We  in  addition  show  that  it  helps  in  a  number  of  other  important 
cases,  including  properly  learning  “parsimonious”  DNF  formulas  (formulas  for  which  no  term 
can  be  deleted  without  appreciably  changing  the  function)  as  well  as  any  2-term  DNF,  a  class 
known  to  be  NP-Hard  to  properly  learn  from  labeled  data  alone. 

Under  the  uniform  distribution,  we  can  properly  learn  any  DNF  formula  for  which  each  vari¬ 
able  appears  in  0(log(n))  terms,  as  well  as  any  DNF  formula  with  0(log(n))  relevant  variables. 

If  we  are  allowed  to  ask  numerical  similarity  queries,  then  we  show  we  can  properly  learn 
any  DNF  formula  having  0(log(n))  terms,  under  arbitrary  distributions,  or  any  DNF  formula 
having  0(log(n))  relevant  variables,  again  under  arbitrary  distributions.  If  we  are  allowed  to  ask 
“Do  these  k  examples  satisfy  any  term  in  common?”  for  arbitrary  (poly-sized)  k,  we  can  even 
properly  learn  arbitrary  DNF  formulas  under  arbitrary  distributions. 

This  topic  of  learning  with  representation- specific  queries  is  interesting,  even  beyond  the 
DNF  case,  and  we  have  explored  a  variety  of  other  learning  problems  of  this  type  as  well. 


4.2  Learning  DNF  with  General  Queries:  Hardness  Results 

Theorem  4.1.  Learning  DNF  from  random  data  under  arbitrary  distributions  with  boolean  sim¬ 
ilarity  queries  is  as  hard  as  learning  DNF  from  random  data  under  arbitrary  distributions  with 
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only  the  labels  (no  queries). 


Proof.  [Kearns,  1989]  and  [Kearns,  Li,  and  Valiant,  1994]  proved  that  “group  learning”  is  equiv¬ 
alent  to  “weak  learning”. 

In  group  learning,  at  each  round  we  are  given  poly(n )  examples  that  are  either  all  iid  from 
D+  or  all  iid  from  79—  (i.e.  all  positive  or  all  negative)  and  our  goal  is  to  figure  out  which 
case  it  is.  Later,  of  course,  Schapire  [Schapire,  1990]  proved  that  weak-learning  is  equivalent  to 
strong-learning.  So,  if  DNF  is  hard  to  PAC-learn,  then  DNF  is  also  hard  to  group-learn. 

Now,  consider  the  following  reduction  from  group-learning  DNF  in  the  standard  model  to 
learning  DNF  in  the  extended  queries  model.  In  particular,  given  an  algorithm  A  for  learning 
from  a  polynomial  number  of  examples  in  the  extended  queries  model,  we  show  how  to  use  A  to 
group-learn  as  follows: 

Given  a  set  S  of  rn  —  poly(n )  examples  x\,  x2,  ...,  xrn  (we  will  use  m  =  tn  where  t  is  the 
number  of  terms  in  the  target),  construct  a  new  example  by  just  concatenating  them  together.  So 
overall  we  now  have  nrn  variables.  We  present  this  concatenated  example  to  A  with  label  equal 
to  the  label  of  S.  If  A  makes  a  similarity  query  between  two  positive  examples  [xi,x2, 
and  [x'llx'2l x'm],  we  simply  output  yes  (i.e.,  that  they  do  indeed  share  a  term  in  common). 

We  now  argue  that  with  high  probability,  the  labels  and  our  responses  to  A  are  all  fully 
consistent  with  some  DNF  formula  of  size  mt.  In  particular,  we  claim  they  will  be  consistent 
with  a  target  function  that  is  just  the  AND  of  m  copies  of  the  original  target  function. 

First  of  all,  note  that  the  AND  of  m  copies  of  the  original  target  function  will  produce  the 
correct  labels  since  by  assumption  either  all  Xi  e  S  are  positive  or  all  xt  e  S  arc  negative. 
Next,  we  claim  that  whp,  any  two  of  these  concatenated  positive  examples  will  share  a  term 
in  common.  Specifically,  if  the  original  DNF  formula  has  t  terms,  then  for  two  random  positive 
examples  from  D+  there  is  probability  at  least  1/7  that  they  share  a  common  term.  So,  the  chance 
of  failure  for  two  concatenated  examples  is  at  most  (1  —  l/7)m.  (Because  the  only  way  that  two 
of  these  big  concatenated  examples  [xj ,  x2, ...,  xm]  and  [x\ ,  x'2 , .. . ,  x'm]  can  fail  to  share  a  term  in 
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common  is  if  X\  and  x\  fail,  x2  and  x'2  fail,  etc.).  Setting  m  =  tn,  the  probability  of  failure  for 
any  given  query  is  at  most  l/en.  Applying  the  union  bound  over  all  polynomially-many  pairs  of 
positive  examples  in  A’s  sample  yields  that  with  high  probability  all  our  responses  are  consistent. 
Therefore,  by  assumption,  A  will  produce  a  low-error  hypothesis  under  the  distribution  over 
concatenated  examples,  which  yields  a  low-error  hypothesis  for  the  group-learning  problem.  □ 

We  can  extend  the  above  result  to  “approximate  numerical”  queries  that  give  the  correct 
answer  up  to  1  ±  r  for  some  constant  r  >  0  (or  even  r  >  l/poly(n)). 

Theorem  4.2.  Learning  DNF  from  random  data  under  arbitrary  distributions  with  approximate- 
numerical-valued  queries  is  as  hard  as  learning  DNF  from  random  data  under  arbitrary  distri¬ 
butions  with  only  the  labels  (no  queries). 

Proof.  Assume  we  have  an  algorithm  A  that  learns  to  error  e/2  given  a  similarity  oracle  that  tells 
us  how  many  terms  two  examples  have  in  common,  up  to  a  multiplicative  factor  r.  Specifically,  if 
C  is  the  number  of  terms  in  common,  the  oracle  returns  a  value  in  the  range  [(1  —  r)C,  (1  +  t)C], 

Now  we  do  the  reduction  from  group  learning  as  before,  forming  higher-dimensional  ex¬ 
amples  by  concatenating  groups  aq,  •  •  •  ,  xm,  all  of  the  same  class,  but  this  time  with  m  = 
2n(t4)(l  +  r/2)2/r2.  Suppose,  for  now,  that  we  know  for  the  original  DNF  formula,  the  ex¬ 
pected  number  of  terms  a  that  two  that  two  random  positive  examples  would  have  in  common 
(we  discharge  this  assumption  later).  In  that  case,  when  queried  by  A  for  the  similarity  between 
two  positive  examples  x ,  x1,  we  simply  answer  with  the  closest  integer  to  am.  As  before,  we 
argue  that  with  high  probability,  our  answers  are  consistent  with  a  DNF  formula  g  consisting  of 
just  m  shifted  copies  of  the  original  DNF. 

Note  that  for  a  random  pair  of  the  concatenatedl  examples  composed  of  positive  sub-examples, 
the  expected  number  of  terms  in  common  in  g  is  rna.  Furthermore,  the  number  of  terms  in  com¬ 
mon  is  a  sum  of  m  independent  samples  of  the  original  random  variable  (the  one  with  mean  a), 
each  of  which  is  bounded  in  the  range  [0,  t] .  So  Hoeffding’s  inequality  implies  that  with  probabil¬ 
ity  1  —  2e~2m2a2(T/2)2 /(mfi2K1+r/2)2)  =  1  —  2e-n  (since  a  >  1/t),  the  number  C  of  terms  in  com- 
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mon  satisfies  | C  —  ma \  <  ma(r/2)/(l  +  r/2),  which  implies  (1  — r/2)C  <  ma  <  (l  +  r/2)C. 

Thus,  for  a  poly(n) -sized  sample  of  data  points,  with  high  probability,  all  of  the  pairs  of 
positive  concatenated  examples  have  the  nearest  integer  to  ma  within  these  factors  of  their  true 
number  of  terms  in  common.  It  therefore  suffices  to  respond  to  A’s  similarity  queries  with  the 
nearest  integer  to  ma. 

Now  the  only  trouble  is  that  we  do  not  know  a.  So  we  just  try  all  positive  integers  i  from 
1  to  mt  and  then  use  a  validation  set  to  select  among  the  hypotheses  produced.  That  is,  we 
run  A  on  the  constructed  data  set  and  respond  to  all  similarity  queries  with  a  single  value  i, 
getting  back  a  classifier  for  these  concatenated  examples,  and  then  repeat  for  each  i.  Then  we 
take  0((l/e)  log(mf/5))  additional  higher-dimensional  samples  (with  labels)  and  choose  the 
classifier  among  these  mt  returned  classifiers,  having  the  smallest  number  of  mistakes  there-on. 

At  least  one  of  these  mt  values  of  i  is  the  closest  integer  to  ma,  so  at  least  one  of  these  mt 
classifiers  is  e/2-good,  and  our  validation  set  will  identify  one  whose  error  is  at  most  e.  So  we 
can  use  this  classifier  to  identify  whether  a  random  m-sized  group  of  examples  is  composed  of 
all  positives  or  all  negatives,  with  error  rate  epsilon:  i.e.,  we  can  do  group  learning. 

If  the  algorithm  A  only  has  a  “high  probability”  guarantee  on  success,  we  can  repeat  this  sev¬ 
eral  times  with  independent  data  sets,  to  boost  the  confidence  that  there  will  be  a  good  classifier 
among  those  we  choose  from  at  the  end,  and  slightly  increase  the  size  of  the  validation  set  to 
compensate  for  this  larger  number  of  classifiers.  □ 

4.3  Learning  DNF  with  General  Queries  :  Positive 

4.3.1  Methods 

The  Neighborhood  Method 

We  refer  to  the  following  simple  procedure  as  the  “neighborhood  method”.  Take  m  =  poly(n,  1/e,  log (1/5)) 
samples.  First,  among  the  positive  examples,  query  all  pairs  (with  the  binary-valued  query)  to 
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construct  a  graph,  in  which  examples  are  adjacent  if  they  satisfy  a  term  in  common.  For  each 
positive  example,  construct  a  minimal  conjunction  consistent  with  that  example  and  all  of  its 
neighbors  (i.e.,  the  consistent  conjunction  having  largest  number  of  literals  in  it).  Next,  discard 
any  of  these  conjunctions  that  make  mistakes  on  any  negative  examples.  Then  sequentially  re¬ 
move  any  conjunction  c\  such  that  some  other  remaining  conjunction  C2  subsumes  it  (contains  a 
subset  of  the  variables).  Form  a  DNF  from  the  remaining  conjunctions.  Produce  this  resultant 
DNF  as  the  output  hypothesis. 

Lemma  4.3.  Suppose  the  target  DNF  has  t  =  poly(n)  terms.  For  an  appropriate  (t-dependent) 
polynomial  sample  size  m,  the  neighborhood  method  will,  with  probability  at  least  1  —  delta, 
produce  an  e-accurate  DNF  if,  for  each  term  Tj  in  the  target  DNF  having  a  probability  of  satis¬ 
faction  at  least  e/2 1,  there  is  at  least  a  p  =  l/poly(n,  1/e)  probability  that  a  random  example 
satisfies  term  Ti  and  no  other  term  (we  call  such  an  example  a  “nice  seed”  for  T). 

Proof  Under  these  conditions,  m  =  0((l/p)  log(t/5)  +  (t/e)  log(l/e5))  samples  suffice  to 
guarantee  each  Ti  with  probability  of  satisfaction  at  least  e/2 1  has  at  least  one  nice  seed,  with 
probability  at  least  1  —  5/2. 

In  the  second  phase,  we  remove  any  conjunction  inconsistent  with  the  negative  examples.  The 
conjunctions  guarnateed  by  the  above  argument  survive  this  pruning  due  to  their  minimality,  and 
the  fact  that  they  are  learned  from  a  set  of  examples  that  actually  are  consistent  with  some  term 
in  the  target  DNF  (due  to  the  nice  seed).  The  final  pruning  step,  which  removes  any  redundancies 
in  the  set  of  conjunctions,  leaves  at  most  t  conjunctions. 

The  terms  that  do  not  have  nice  seeds  compose  at  most  e/2  total  probability  mass,  and  m  is 
large  enough  so  that  with  probability  at  least  1  —  5/4,  at  most  an  e/4-fraction  of  the  data  satisfy 
these  terms.  Thus,  since  the  result  of  the  neighborhood  method  is  a  DNF  formula  with  at  most 
t  terms,  which  correctly  labels  a  1  —  e/2  fraction  of  the  m  examples,  the  standard  PAC  bounds 
imply  that  with  probability  at  least  1  —  5/4,  the  resulting  DNF  has  error  rate  at  most  e.  A  union 
bound  over  the  above  events  implies  this  holds  with  probability  at  least  1  —  5.  □ 
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The  Common  Profile  Approach 


In  the  case  of  numerical  queries,  we  have  some  additional  flexibility  in  designing  a  method.  In 
this  context,  we  refer  to  the  following  procedure  as  the  “common  profiles  approach”. 

Consider  a  sample  of  m  =  poly(n,  1/e,  log(l/5))  random  labeled  examples,  and  for  each 
pair  of  positive  examples  x,  y,  we  request  the  number  K(x.  y )  of  terms  they  satisfy  in  common; 
we  additionally  request  K{x ,  x)  for  each  positive  example  x.  For  each  positive  example  x,  we 
identify  the  set  S  of  examples  y  such  that  the  numerical  value  of  K(x,  y)  is  equal  K{x,  x ).  So 
these  points  satisfy  at  least  all  the  terms  x  satisfies.  For  each  such  set  S,  we  learn  a  minimal 
conjunction  consistent  with  these  examples.  Then  for  each  of  these  conjunctions,  if  it  is  a  spe¬ 
cialization  of  some  other  one  of  the  conjunctions,  we  discard  it.  Then  we  form  our  hypothesis 
DNF  with  the  remaining  conjunctions  as  the  terms. 

For  any  example  x,  relative  to  a  particular  target  DNF,  we  refer  to  the  “profile”  of  x  as  the  set 
of  terms  T%  in  the  target  DNF  satisfied  by  x. 

Lemma  4.4.  If  the  target  DNF  has  at  most  p  =  poly(n)  possibel  profiles,  then  the  common 
profile  approach,  with  an  appropriate  (p-dependent)  sample  size  m,  will  with  probability  at  least 
1  —  5,  produce  a  DNF  having  error  rate  at  most  e. 


Proof.  Note  that  this  procedure  produces  a  DNF  that  correctly  labels  the  entire  data  set,  since 
K (x,  y)  =  K (x,  x)  implies  x  and  y  have  the  same  profiles,  so  that  in  particular  the  set  S  has  some 
term  in  common  to  all  the  examples.  If  there  are  only  a  poly(n)  number  of  possible  profiles, 
then  the  above  will  only  produce  at  most  as  many  distinct  terms  in  its  hypothesis  DNF,  so  that  a 
sufficiently  large  poly  (n) -sized  data  set  will  be  sufficient  to  guarantee  good  generalization  error. 
Specifically,  m  =  0((pn/e)  log (!/«)))  examples  are  enough  to  guarantee  with  probability  at 
least  1  —  5,  any  DNF  consistent  with  the  data  having  at  most  p  terms  will  have  error  rate  at  most 
e,  so  this  is  sufficient  for  the  common  profile  approach.  □ 
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4.3.2  Positive  Results 


Theorem  4.5.  With  numerical-valued  queries ,  we  can  properly  learn  any  DNF  having  0(log(n)) 
relevant  variables,  under  arbitrary  distributions. 

Proof.  These  targets  have  poly(n)  possible  profiles,  so  the  common  profiles  approach  will  be 
successful.  □ 

Theorem  4.6.  If  the  target  DNF  has  only  0(log(n))  terms,  then  we  can  efficiently  properly  learn 
from  random  data  under  any  distribution  using  numerical-valued  queries. 

Proof.  There  are  only  poly(n)  number  of  possible  profiles,  so  the  “common  profiles”  approach 
will  work.  □ 

The  above  result  is  interesting  particularly  because  proper  learning  (even  for  2-term  DNF)  is 
known  to  be  hard  from  labeled  data  alone. 

Theorem  4.7.  If  the  target  DNF  has  t  =  poly  (n)  terms,  and  is  such  that  any  example  can  satisfy 
at  most  0(1)  terms,  then  we  can  efficiently  properly  learn  from  random  data  using  numerical- 
valued  queries. 

Proof.  There  are  at  most  poly(f)  =  poly(n)  possible  profiles,  so  the  “common  profiles”  ap¬ 
proach  will  work.  □ 

Corollary  4.8.  We  can  properly  learn  any  k-term  DNF  with  numerical-valued  queries,  where  k 
is  constant. 

Proof.  This  follows  from  either  Theorem  4.6  or  Theorem  4.7.  □ 

Corollary  4.9.  If  the  DNF  is  such  that  any  example  can  satisfy  at  most  1  term  (a  so-called 
“disjoint”  DNF),  then  we  can  efficiently  properly  learn  from  random  data  using  binary -valued 
queries. 

Proof.  A  numerical  query  whose  value  can  be  at  most  1  is  just  a  binary  query  anyway.  □ 
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In  particular,  Decision  Trees  can  be  thought  of  as  a  DNF  where  each  example  satisfies  at 
most  1  term. 

Lemma  4.10.  If  it  happens  that  the  target  DNF  is  parsimonious  (no  redundant  terms)  for  some 
random  Fliftn/e)  log(l/e)  +  (1/e)  \og(l  /  5)) -sized  data  set  (for  any  distribution ),  then  we  can 
efficiently  produce  a  DNF  consistent  with  it  having  at  most  t  terms  using  binary-valued  queries. 

Proof  (Sketch)  Parsimonious,  in  this  case,  means  that  we  cannot  remove  any  terms  without 
changing  some  labels.  But  this  means  that  every  term  has  some  example  that  satisfies  only  that 
term  (i.e.,  a  nice  seed).  So  as  described  in  the  proof  of  Lemma  4.3  above,  the  “neighborhood 
method,”  produces  a  DNF  with  terms  for  the  neighborhoods  of  each  of  these  nice  seeds,  which 
in  the  parsimonious  case,  covers  all  of  the  positive  examples.  □ 

Theorem  4.11.  We  can  properly  learn  2-term  DNF  with  binary  queries. 

Proof.  Take  0((n/e)  log(l/e)  +  (1/e)  log (1/5))  random  labeled  examples  and  make  the  binary 
query  for  all  pairs  of  positive  examples.  First,  find  a  minimal  conjunction  consistent  with  all 
of  the  positive  examples;  if  this  conjunction  does  not  misclassify  any  negative  examples,  return 
it.  By  classic  PAC  bounds,  a  conjunction  consistent  with  this  many  random  labeled  examples 
will,  with  probabiliy  at  least  1  —  5,  have  error  rate  at  most  e.  Otherwise,  if  this  conjunction 
misclassifies  some  negatives,  then  we  are  assured  the  target  DNF  is  parsimonious  for  this  data 
set,  and  thus  Lemma  4.10  guarantees  we  can  efficiently  identify  a  2-term  DNF  consistent  with  it 
using  the  binary- valued  queries.  Again,  the  classic  PAC  bounds  imply  the  sample  size  is  large 
enough  to,  with  probability  at  least  1  —  5,  guarantee  that  any  consistent  2-term  DNF  has  error 
rate  at  most  e.  □ 

Theorem  4.11  gives  a  concrete  result  where  using  this  type  of  query  overturns  a  known  hard¬ 
ness  result  for  supervised  learning. 

Open  problem  Can  this  idea  be  extended  to  learning  3-term  DNF  or  higher,  still  using  only 
the  binary-valued  queries?  Or  is  there  a  hardness  result  for  properly  learning  3-term  DNF  with 
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these  binary- valued  pairwise  queries? 


4.4  Learning  DNF  under  the  Uniform  Distribution 

In  this  section,  we  investigate  the  problem  of  learning  DNF  under  a  uniform  distribution  on 
{0,  l}n,  using  the  binary-valued  queries. 

Definition  4.12.  Fix  a  constant  c  G  (0,  oo).  We  say  a  term  t  in  the  target  DNF  is  “relatively 
distinct”  if  it  contains  a  variable  v  which  occurs  in  at  most  clog(n)  other  terms.  We  say  v  is  a 
witness  to  t  being  relatively  distinct. 

Definition  4.13.  For  a  term  t  in  the  target  DNF,  and  a  variable  v  in  t,  we  say  v  is  “sometimes 
nonredundant”  for  t  if  given  a  random  example  that  satisfies  t,  there  is  at  least  an  e  probability 
that  every  term  in  the  target  DNF  that  the  example  satisfies  also  contains  v. 

Theorem  4.14.  Suppose  no  term  in  the  target  DNF  is  logically  entailed  from  any  other  term 
in  the  target  DNF,  every  term  t  is  relatively  distinct,  and  that  some  variable  v  that  is  a  witness 
to  t  being  relatively  distinct  is  sometimes  nonredundant  for  t.  Then  we  can  properly  learn  any 
monotone  DNF  of  this  type  under  a  uniform  distribution  on  {0, 1}"  with  binary  pair-wise  queries. 

Proof.  By  Lemma  4.3,  it  suffices  to  show  that  every  term  having  at  least  e/ (2 T)  probability  of 
being  satisfied  will,  with  high  probability,  have  some  example  satisfying  only  that  term,  given  a 
polynomial-sized  data  set. 

Consider  a  given  term  t  in  the  target  DNF,  and  choose  the  v  that  witnesses  relative  distinctness 
which  is  sometimes  nonredundant.  Note  that  every  other  term  in  the  target  DNF  contains  some 
variable  not  present  in  t,  and  in  particular  this  is  true  for  the  (at  most)  clog(n)  terms  containing 
v.  So  under  the  conditional  distribution  given  that  t  is  satisfied  and  that  v  is  nonredundant,  with 
probability  at  least  2_clog(n)  =  n~c,  none  of  these  other  terms  containing  v  are  satisfied,  so  that  t 
is  the  only  term  satisfied.  Thus,  since  t  has  probability  at  least  e/(2T)  of  being  satisfied,  and  v  has 
probability  at  least  e  of  being  nonredundant  given  that  t  is  satisfied,  we  have  that  with  probability 
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at  least  ( e2/T)n~c ,  a  random  example  satisfies  t  and  no  other  terms  in  the  target  DNF. 

Since  this  is  the  case  for  all  terms  in  the  target,  a  sample  of  size  0((T/e2)nc  log(T/5))  guar¬ 
antees  every  term  has  some  example  satisfying  only  that  term,  with  probability  at  least  1  —  5.  □ 

We  can  also  consider  the  class  of  DNF  function  having  only  a  small  number  of  relevant 
variables.  In  this  context,  it  is  interesting  to  observe  that  if  the  ith  variable  is  irrelevant,  then 
P(K(x,y)  =  1  and  ay  f  yf  =  P(K(x,y )  =  1  and  a:*  =  y-i),  where  x  and  y  are  independent 
uniformly-distributed  samples,  and  K (ay  y)  =  1  iff  x  and  y  are  positive  examples  that  satisfy  at 
least  one  term  in  common.  However,  as  the  following  lemma  shows,  this  is  not  true  for  relevant 
variables. 

Lemma  4.15.  For  x  and  y  independent  uniformly -distributed  samples,  if  the  target  function  has 
r  relevant  variables,  and  the  ith  variable  is  relevant  in  the  target  function,  then  P(K(x.  y  )  = 
1  and  Xj  =  yf  —  P(K(x,  y)  =  1  and  ay  f  yf)  >  (l/4)r. 

Proof  For  each  pair  ( x ,  y)  with  ay  f  y.,,  there  is  a  unique  corresponding  pair  (V,  y)  with  x'rj  =  Xj 
for  j  f  i,  and  x\  =  y,  .  Let  M,  be  the  number  of  x ,  y  pairs  with  xt  f  y.,  and  K(x.  y)  =  1.  Then 
note  that  for  every  x,y  pair  with  xt  f  y,  and  K(x,  y)  =  1,  we  also  have  K(x',y )  =  1,  since 
whatever  term  x  and  y  satisfy  in  common  cannot  contain  variable  i  anyway,  so  flipping  that 
feature  in  x  does  not  change  whether  x  and  y  share  a  term  or  not.  In  particular,  this  implies 
the  number  of  x,  y  pairs  with  Xi  =  y,  and  I\(x,  y  )  =  1  is  at  least  Mt  .  However,  we  can  also 
argue  it  is  strictly  larger,  as  follows.  By  definition  of  “relevant”,  each  of  the  2r  settings  of  the 
relevant  variables  corresponds  to  an  equivalence  class  of  feature  vectors,  all  of  which  have  the 
same  label,  and  if  that  label  is  positive,  then  all  of  which  have  the  same  profile.  Since  variable  i 
is  relevant,  at  least  one  of  the  2r  settings  of  the  relevant  variables  yields  an  equivalence  class  of 
positive  examples  whose  profile  contains  only  terms  with  variable  i  in  them  (these  are  positive 
examples  such  that  flipping  variable  i  makes  them  negative).  The  probability  that  both  x  and  y 
(chosen  at  random)  are  in  this  equivalence  class  is  (l/4)r.  Note  that  for  the  (x.  y)  pairs  of  this 
type,  we  have  K(x,  y)  =  1;  however,  if  we  flip  feature  ay,  then  x  would  become  negative,  and 
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hence  K(x,  y )  would  no  longer  be  1;  this  means  this  (x.  y )  pair  is  not  included  among  those  Mt 
pairs  constructed  above  by  flipping  ay  starting  from  some  (x,  y)  with  x,  f  y,  and  K(x,  y)  =  1. 
So  P(K(x,  y)  =  1  and  xt  =  yf  -  P(K(x,  y)  =  1  and  xt  ±  yi)  =  (M*/4n  +  (l/4)r)  -  M*/4n  = 
(l/4)r.  □ 

Theorem  4.16.  Under  the  uniform  distribution,  with  binary  pairwise  queries ,  we  can  properly 
learn  any  DNF  having  0(log(n))  relevant  variables. 

Proof.  We  can  use  the  property  in  Lemma  4.15  to  design  an  algorithm  as  follows.  For  each  i, 
sample  Q(8r  log(n/5))  random  pairs  (x,  y),  and  evaluate  K(x,  y)  for  each  pair.  Then  calculate 
the  difference  of  empirical  probabilities  (fraction  of  pairs  (x,  y)  for  which  K(x,  y)  =  1  and 
Xi  =  yi  minus  fraction  of  pairs  (x,  y)  for  which  K(x,  y)  =  1  and  Xi  f  y,).  If  this  difference 
is>  (l/2)(l/4)r,  decide  variable  i  is  relevant,  and  otherwise  decide  variable  i  is  irrelevant. 
By  Hoeffding  and  union  bounds,  with  probability  1  —  5/2,  this  will  find  exactly  the  r  relevant 
variables.  Now  enumerate  all  2r  =  poly(n)  possible  conjunctions  that  can  be  formed  from 
using  all  of  these  r  relevant  variables.  Considering  this  as  a  2r -dimensional  feature  space,  take 
Q((2r /e)log(l/5))  random  labeled  data  points  and  learn  a  disjunction  over  this  2 r -dimensional 
feature  space;  since  the  VC  dimension  of  this  set  of  disjunctions  is  2r,  the  usual  PAC  analysis 
implies  this  will  learn  an  e-good  disjunction  with  probability  1  —  5/2.  A  union  bound  implies 
both  stages  (finding  variables  and  learning  the  disjunction)  will  succeed  with  probability  at  least 
1  —  5.  □ 

An  alternative  approach  to  the  second  stage  in  the  proof  would  be  to  take  f2(2r  log(2r/5)) 
random  samples,  so  that  with  probability  at  least  1—5/2,  we  have  at  least  one  data  point  satisfying 
each  of  the  2r  possible  conjunctions  on  the  relevant  variables;  then  for  each  of  the  conjunctions, 
we  check  the  label  of  the  example  that  satisfies  it,  and  if  that  label  is  positive,  we  include  that 
conjunction  as  a  term  in  our  DNF,  and  otherwise  we  do  not  include  it.  This  has  the  property  that, 
altogether,  with  probability  1  —  5,  we  construct  a  DNF  that  has  error  rate  zero. 
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Another  family  of  DNF  studied  in  the  literature  are  those  with  a  sublinear  number  of  terms. 
Specifically,  [Servedio,  2004]  proved  that  the  class  of  2°(v/logn-l-term  monotone  DNF  are  learn- 
able  under  the  uniform  distribution  from  labeled  data  alone.  As  the  following  theorem  states, 
we  can  extend  this  result  to  include  general  2f  )(  v//o'/n)-term  DNF  (including  non-monotone)  given 
access  to  our  binary  pairwise  queries. 

Theorem  4.17.  Under  the  uniform  distribution,  with  binary  pairwise  queries,  we  can  learn  any 
2 Oidiogn) _term  DNF  (supposing  e  to  be  a  constant). 

First,  we  review  some  known  results  from  [Servedio,  2004],  For  any  function  g  :  {0,  l}n  — * 
{  — 1,  +1},  define  the  gl{  and  gi  0  functions  by  the  property  that  any  x  with  xt  =  1  has  gr]  (x)  = 
g(x),  and  g,j)(x)  =  g(y),  where  yj  =  x3  for  j  f  i  and  y,  =  0.  Then  define  the  influence 
function  Ifg)  =  P(git0(x)  f  /yt  l  (x)).  [Servedio,  2004]  developed  a  procedure,  FindVariable, 
which  uses  a  poly(n,  I/7,  log(l/r/))  number  of  random  labeled  samples,  labeled  according  to 
any  monotone  DNF  g  having  at  most  t  terms,  and  with  probability  1  —  77,  returns  a  set  S  of 
variables  (indices  in  {1, . . . ,  n})  such  that  every  i  f  S  has  Ifg)  <  7  and  every  i  e  S  has 
fig)  >7/2  and  the  ith  variable  is  contained  in  some  term  in  g  with  at  most  log  '^p-  variables  in 
it. 

Furthermore,  [Servedio,  2004]  showed  that,  for  any  f-term  DNF  /,  if  we  are  provided  with 
a  set  Sf  C  {1, . . . ,  n}  such  that  every  i  £  Sf  has  Iff)  <  e/4 n,  then  we  can  learn  /  in  time 
polynomial  in  n,  |  ‘S'/ 1 0(^log  ^ log  i  ^ ,  and  log(l/<5).  In  particular,  for  \Sf\  =  Oft  logy)  and  t  = 
2°(v/logn),  this  is  polynomial  in  n  (though  not  necessarily  in  e).  Given  the  set  Sf,  the  learning 
procedure  simply  estimates  the  Fourier  coefficients  for  small  subsets  of  Sf. 

Proof  of  Theorem  4.17.  To  prove  Theorem  4.17,  we  consider  the  following  procedure.  First 
sample  m  labeled  examples  x(l) . . . . ,  xirn)  at  random.  Then,  for  each  j  <  rri,  define  Kjf)  = 
K(x^\-).  Now  note  that,  if  we  define  <pfy)  =  (pjfy), . . .  ,<pjn(y))  by  ipjfy)  =  2 1[yt  = 
x^p]  —  1,  then  we  can  represent  Kj(-)  =  {K'fipf-))  +  l)/2,  where  K)  is  a  monotone  DNF  (map¬ 
ping  into  {—!,+!});  specifically,  the  terms  in  K'-  correspond  to  the  terms  in  the  target  satisfied 
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by  x<3),  except  none  of  the  literals  are  negated.  We  then  run  FindVariable  for  each  of  these  K'-, 
with  7  =  e/m  and  rj  =  5 /2m.  Let  Sf  denote  the  union  (over  j  <  m )  of  the  returned  sets  of  vari¬ 
ables.  It  remains  only  to  show  this  Sf  satisfies  the  requirements  for  the  procedure  of  [Servedio, 
2004],  including  the  size  requirement. 

Taking  m  =  log  |),  with  probability  at  least  1  —  5/4,  every  term  in  the  target  having 
probability  at  least  e/2 ct  will  have  at  least  one  of  the  m  examples  satisfying  it.  Suppose  this 
event  happens.  In  particular,  this  means  error (maxj  Kj)  <  e/2 c.  Note  that 

Ii{f)  =  P(fi0(x)  f  fi,i(x))  <  2P(maxKj(x)  ±  f(x))  +  P((max Kj)i0(x)  f  (maxA/)a(j;)) 

j  j  j 

<  e/c+Y.P(l,K'j)i,  oM  #  (K'MX))  =t/c+YlIi(K'i)- 

j  j 

Thus,  by  a  union  bound,  with  probability  1  —  5/2,  any  variable  i  </  Sf  has  /,  (/)  <  e/c  +  my, 
and  any  variable  i  G  Sf  appears  in  a  term  in  some  K'-  of  size  at  most  log  and  therefore 
also  appear  in  a  corresponding  term  of  this  size  in  /.  Suppose  this  happens.  Letting  c  =  8 n  and 
7  =  e/8 nm,  we  have  that  any  i  </  Sf  has  /,  (  /  )  <  e/4n,  while  any  i  G  Sf  appears  in  a  term  of 
size  at  most  log  256/r)2m  =  0(log  tn lo/(  1  ) .  In  particular,  this  implies  | Sf  \  =  0(t  log  twl°g(1/<5)), 

and  Sf  satisfies  the  requirements  of  the  method  of  [Servedio,  2004], 

Thus,  running  the  procedure  from  [Servedio,  2004]  with  confidence  parameter  5/4,  a  union 
bound  implies  the  total  probability  of  successfully  producing  an  e-good  classifier  is  at  least  1  —  5. 
The  above  process  of  constructing  Sf  is  clearly  polynomial-time.  Then,  if  t  =  2°fv/|og"\  the 
procedure  of  [Servedio,  2004]  runs  in  time  polynomial  in  n,  log(l/5),  and  |S'j|°(log^/e)log(1/e)), 
which  is  polynomial  in  n  and  log  (1/5)  (though  not  necessarily  in  e).  □ 


4.5  More  Powerful  Queries 

Theorem  4.18.  If  we  can  construct  our  own  feature  vectors  in  addition  to  getting  random  data, 
then  under  any  distribution  we  can  efficiently  properly  learn  DNF  using  binary-valued  queries. 
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Proof.  Suppose  we  can  adaptively  construct  our  own  examples.  Suppose  the  target  DNF  has 
T  =  poly(n)  terms.  Oraclefx,  x')  gives  the  number  of  terms  that  x  and  x'  have  in  common.  For 
any  x,  let  ;r_,  be  x  but  with  the  ith  bit  flipped.  Let  x  be  the  negative  of  x. 

Below  is  an  algorithm.  Movefx,  x')  moves  x'  away  from  x  by  one  bit,  while  trying  to  main¬ 
tain  at  least  one  common  term.  LearnTerm(x)  returns  a  term  in  the  target  function. 


0.  Move(x,  x') 

1.  x"  <—  x 

2.  For  i  —  1,  2, n  s.t.  Xi  —  x\ 

3.  If  (Oracle(;r,  x")  <  Oracle(a;,  x'_fj) 

4.  x"  <—  x'_% 

5.  Return  x" 


0.  LearnTermfa  ) 

1 .  Replicate  x  to  get  x' 

2.  While  (Oracle(a;,  Move(a:,  x')) !  =  0) 

3.  x'  <—  Move(a:,  x') 

4.  Let  I  <—  {i  :  Oracle{x^x'_i)  =  0} 

5.  Return  xj  (i.e.  a  conjunction  with  the  literals  indexed  by  /,  either  positive  or  negative  so 
that  x  satisfies  it) 


0.  LearnDNF 

1 .  Initialize  all-negative  DNF  h 

2.  Take  M  =  poly(n )  S>  nT  random  examples  S 
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3.  For  each  x  e  S 


4.  If  Oraclc(x,:/-)  >  0  (positive  example)  and  h(x)  =  negative 

5.  Add  term  LearnTerm(x)  to  h 

6.  Return  h  (a  DNF  with  at  most  T  terms,  consistent  with  all  M  examples) 


When  we  reach  x'  such  that  we  can't  flip  any  more  bits  (not  already  flipped)  without  making 
it  so  they  don’t  satisfy  any  terms  in  common  anymore,  then  the  bits  these  two  have  in  common 
must  form  a  term  in  the  target  DNF,  so  LearnTcrm(x)  should  still  find  a  term  in  the  target  DNF 

□ 


If  we  can  ask  about  k-tuples  of  examples  (do  they  all  jointly  satisfy  a  term  in  common?),  we 
have  the  following  result: 

Theorem  4.19.  If  we  can  use  query  sets  of  arbitrary  sizes  ( instead  of  just  2  points),  then  under 
any  distribution  we  can  efficiently  properly  learn  DNF  using  binary-valued  queries  from  random 
data. 

Proof  We  take  any  set  of  examples  and  ask  the  oracle  the  number  of  terms  all  examples  in  the 
set  have  in  common.  Let  S  be  the  query  set.  The  idea  is  to  greedily  add  the  examples  to  S  while 
keeping  some  terms  in  common. 

Algorithm: 

0.  Input :  dataset  D 

1 .  Initialize  S  to  be  an  empty  set 

2.  Do{ 

3.  Do{ 

4.  V max  <—  0 

5.  For  each  example  x  in  the  dataset  D 

6.  add  x  to  the  set  S 
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7.  query  the  combined  set  S,  and  let  r  =  Oracle(S),  rmax  •*—  ma x{rmax,  r} 

8.  If  r  =  0,  remove  x  from  S,  and  otherwise  leave  it  in  S  and  remove  x  from  I) 

9.  }  Until(r max  =  0) 

10.  Learn  a  “most-specific”  conjunction  from  S  and  add  that  term  to  the  hypothesis  DNF 

1 1 .  Reset  S  to  empty  set 

12.  }Unffl  (| D |  =  0) 

Each  time  we  add  a  term  to  the  DNF,  the  examples  in  S  satisfy  some  term  in  the  target  DNF, 
because  we  only  add  each  example  if  by  adding  it  S  still  has  at  least  one  term  in  common.  So  the 
’’most-specific”  conjunction  consistent  with  S  (i.e.,  the  one  with  most  literals  in  it,  still  labeling 
all  of  S  positive)  will  not  misclassify  any  negative  point  as  positive.  Since  whenever  we  add  a 
new  term,  there  were  no  additional  examples  in  D  that  could  have  satisfied  a  term  in  common 
with  the  examples  in  S,  after  adding  the  term  we  have  removed  from  D  all  examples  that  satisfy 
the  term  S  has  in  common.  Therefore,  the  number  of  terms  in  our  learnt  DNF  is  at  most  the 
number  of  terms  T  in  the  true  DNF.  If  the  total  number  of  examples  is  3>  nT  (and  say  T  is 
poly(n )),  it  will  get  us  a  DNF  that  has  at  most  T  terms  and  correctly  labels  a  poly(n )  S>  nT 
sized  dataset.  Since  the  training  dataset  size  is  much  larger  than  the  size  of  the  classifier,  by  the 
Occam  bound,  the  learnt  DNF  will  have  small  generalization  error. 

□ 


4.6  Learning  DNF  with  General  Queries:  Open  Questions 

•  Is  it  possible  to  efficiently  learn  an  arbitrary  DNF  from  random  data  under  arbitrary  distri¬ 
butions  with  numerical- valued  queries? 

•  Is  it  possible  to  efficiently  learn  a  DNF  with  0(1)  terms  from  random  data  under  arbitrary 
distributions  with  binary-valued  queries? 
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Is  it  possible  to  efficiently  learn  a  monotone  DNF  from  random  data  under  a  uniform 
distribution  with  numerical-valued  queries?  If  so,  what  about  binary-valued  queries? 


4.7  Generalizations 

4.7.1  Learning  Unions  of  Halfspaces 

Several  of  the  above  results  generalize  nicely  to  the  more  general  problem  of  learning  unions  of 
halfspaces.  Specifically,  the  queries  are  of  the  type  “do  these  two  examples  satisfy  a  halfspace  in 
common?”  or  “how  many  halfspaces  do  these  two  examples  satisfy  in  common?”  The  general¬ 
ized  forms  of  Theorem  4. 19  and  Lemma  4. 10  follow  by  the  exact  same  arguments.  In  each  case, 
the  algorithm  finds  sets  of  examples  that  satisfy  some  halfspace,  such  that  none  of  the  remaining 
examples  satisfy  that  halfspace,  so  for  each  such  set  we  simply  find  a  linear  separator  to  separate 
those  examples  from  the  rest,  and  take  their  union  to  form  our  final  classifier.  A  sufficiently 
large  (poly(n,l/e)-sized)  set  suffices  to  guarantee  this  works.  It  is  not  so  clear  how  to  generalize 
Theorem  4.7,  since  it  is  not  clear  how  to  use  the  sets  of  examples  with  the  common  profiles  to 
learn  the  halfspaces.  The  generalized  version  of  Theorem  4.6  actually  follows  from  the  result 
below  on  learning  Voronoi  diagrams.  The  generalized  version  of  Theorem  4.18  is  simple,  since 
it  is  even  known  that  labeled  data  plus  membership  queries  are  sufficient. 

4.7.2  Learning  Voronoi  with  General  Queries 

Consider  the  space  of  Voronoi  diagrams  (vector  quantizers);  specifically,  the  target  function  is 
constant  within  each  cell  of  the  Voronoi  diagram,  and  there  are  poly(n)  such  cells  for  a  given 
target  function.  We  define  a  “same  cell”  query  as  asking,  for  a  pair  of  examples  x  and  y,  whether 
x  and  y  occur  in  the  same  cell  of  the  target  function.  With  this  type  of  query,  we  can  efficiently 
properly  learn  Voronoi  partitions  from  random  data,  under  arbitrary  distributions.  To  prove  this, 
we  simply  group  the  examples  in  a  sufficiently  large  sample  into  equivalence  classes  based  on 
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these  same-cell  queries.  For  each  pair  of  such  equivalence  classes,  we  find  a  linear  separator  that 
separates  them.  For  each  test  point,  we  evaluate  these  linear  separators,  which  thereby  associates 
the  test  point  with  one  of  the  equivalence  classes  from  the  training  data,  and  we  predict  as  a  label 
for  that  point  the  label  associated  with  that  equivalence  class.  If  we  have  a  sufficiently  large 
training  set,  then  there  is  only  a  small  probability  the  test  point  gets  placed  into  a  different  set  of 
points  from  those  in  its  own  cell. 
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Chapter  5 


Bayesian  Active  Learning  with  Arbitrary 
Binary  Valued  Queries 


Abstract 

‘We  investigate  the  minimum  expected  number  of  bits  sufficient  to  encode  a  random  variable  X 
while  still  being  able  to  recover  an  approximation  of  X  with  expected  distance  from  X  at  most 
D:  that  is,  the  optimal  rate  at  distortion  D,  in  a  one-shot  coding  setting.  We  find  this  quantity  is 
related  to  the  entropy  of  a  Voronoi  partition  of  the  values  of  X  based  on  a  maximal  D-packing. 


5.1  Introduction 

In  this  work,  we  study  the  fundamental  complexity  of  lossy  coding.  We  are  particularly  interested 
in  identifying  a  key  quantity  that  characterizes  the  expected  number  of  bits  (called  the  rate ) 
required  to  encode  a  random  variable  so  that  we  may  recover  an  approximation  within  expected 
distance  D  (called  the  distortion).  This  topic  is  a  generalization  of  the  well-known  analysis  of 
exact  coding  by  Shannon  [Shannon,  1948],  where  it  is  known  that  the  optimal  expected  number 

'joint  work  with  Jaime  Carbonell  and  Steve  Hanneke. 
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of  bits  is  precisely  characterized  by  the  entropy.  There  are  many  problems  in  which  exact  coding 
is  not  practical  or  not  possible,  so  that  lossy  coding  becomes  necessary:  particularly  for  random 
variables  taking  values  in  uncountably  infinite  spaces.  The  topic  of  code  lengths  for  lossy  coding 
is  interesting,  both  for  its  direct  applications  to  compression,  and  also  as  a  general  setting  in 
which  to  derive  lower  bounds  for  specializations  of  the  setting. 

There  is  much  existing  work  on  lossy  binary  codes.  In  the  present  work,  we  are  interested 
in  a  “one-shot”  analysis  of  lossy  coding  [Kieffer,  1993],  in  which  we  wish  to  encode  a  single 
random  variable,  in  contrast  to  the  analysis  of  “asymptotic”  source  coding  [Cover  and  Thomas, 
2006],  in  which  one  wishes  to  simultaneously  encode  a  sequence  of  random  variables.  Of  par¬ 
ticular  relevance  to  the  one-shot  coding  problem  is  the  analysis  of  quantization  methods  that 
balance  distortion  with  entropy  [Gersho,  1979,  Kieffer,  1993,  Zador,  1982].  In  particular,  it  is 
now  well-known  that  this  approach  can  yield  codes  that  respect  a  distortion  contraint  while  nearly 
minimizing  the  rate,  so  that  there  are  near-optimal  codes  of  this  type  [Kieffer,  1993].  Thus,  we 
have  an  alternative  way  to  think  of  the  optimal  rate,  in  terms  of  the  rate  of  the  best  distortion- 
constrained  quantization  method.  While  this  is  interesting,  in  that  it  allows  us  to  restrict  our  focus 
in  the  design  of  effective  coding  techniques,  it  is  not  as  directly  helpful  if  we  wish  to  understand 
the  behavior  of  the  optimal  rate  itself.  That  is,  since  we  do  not  have  an  explicit  description  of  the 
optimal  quantizer,  it  may  often  be  difficult  to  study  the  behavior  of  its  rate  under  various  interest¬ 
ing  conditions.  There  exist  classic  results  lower  bounding  the  achievable  rates,  most  notably  the 
famous  Shannon  lower  bound  [Shannon,  1959],  which  under  certain  restrictions  on  the  source 
and  the  distortion  metric,  is  known  to  be  fairly  tight  in  the  asymptotic  analysis  of  source  coding 
[Linder  and  Zamir,  1994].  However,  there  are  few  general  results  explicitly  and  tightly  charac¬ 
terizing  the  (non-asymptotic)  optimal  rates  for  one-shot  coding.  In  particular,  to  our  knowledge, 
only  a  few  special-case  calculations  of  the  exact  value  of  this  optimal  rate  have  been  explicitly 
carried  out,  such  as  vectors  of  independent  Bernoulli  or  Gaussian  random  variables  [Cover  and 
Thomas,  2006]. 
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Below,  we  discuss  a  particular  distortion-constrained  quantizer,  based  on  a  Voronoi  partition 
induced  by  a  maximal  packing.  We  are  interested  in  the  entropy  of  this  quantizer,  as  a  quantity 
used  to  characterize  the  optimal  rate  for  codes  of  a  given  distortion.  While  it  is  clear  that  this 
entropy  upper  bounds  the  optimal  rate,  as  this  is  the  case  for  any  distortion-constrained  quantizer 
[Kieffer,  1993],  the  novelty  of  our  analysis  lies  in  noting  the  remarkable  fact  that  the  entropy 
of  any  quantizer  constructed  in  this  way  also  lower  bounds  the  optimal  rate.  In  particular,  this 
provides  a  method  for  approximately  calculating  the  optimal  rate  without  the  need  to  optimize 
over  all  possible  quantizers.  Our  result  is  general,  in  that  it  applies  to  an  arbitrary  distribution 
and  an  arbitrary  distortion  measure  from  a  general  class  of  finite-dimensional  pseudo-metrics. 
This  generality  is  noteworthy,  as  it  leads  to  interesting  applications  in  statistical  learning  theory, 
which  we  describe  below. 


Our  analysis  is  closely  related  to  various  notions  that  arise  in  the  study  of  e-entropy  [Posner 
and  Rodemich,  1971,  Posner,  Rodemich,  and  Rumsey,  Jr.,  1967],  in  that  we  are  concerned  with 
the  entropy  of  a  Voronoi  partition  induced  by  an  e-cover.  The  notion  of  e-entropy  has  been 
related  to  the  optimal  rates  for  a  given  distortion  (under  a  slightly  different  model  than  studied 
here)  [Posner  and  Rodemich,  1971,  Posner,  Rodemich,  and  Rumsey,  Jr.,  1967] .  However,  there 
are  some  important  distinctions,  perhaps  the  most  significant  of  which  is  that  calculating  the 
e-entropy  requires  a  prohibitive  optimization  of  the  entropy  over  all  e-covers;  in  contrast,  the 
entropy  term  in  our  analysis  can  be  calculated  based  on  any  maximal  e-packing  (which  is  a 
particular  type  of  e-cover).  Maximal  e-packings  are  easy  to  construct  by  greedily  adding  arbitrary 
new  elements  to  the  packing  that  are  e-far  from  all  elements  already  added;  thus,  there  is  always 
a  straightforward  algorithmic  approach  to  applying  our  results. 
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5.2  Definitions 


We  suppose  X*  is  an  arbitrary  (nonempty)  set,  equipped  with  a  separable  pseudo-metric  p  : 
X*  x  X*  —>  [0,  oo).  2  We  suppose  X*  is  accompanied  by  its  Borel  rr-algcbra  induced  by  p.  There 
is  additionally  a  (nonempty,  measurable)  set  X  C  X* ,  and  we  denote  by  p  =  sup  p(h\,  h2). 

hitoex 

Finally,  there  is  a  probability  measure  7r  with  n{X)  =  1,  and  an  X -valued  random  variable  X 
with  distribution  n,  referred  to  here  as  the  “target.”  As  the  distribution  is  essentially  arbitrary,  the 
results  below  will  hold  for  any  n. 

A  code  is  a  pair  of  (measurable)  functions  (0, 0).  The  encoder,  0,  maps  any  element  x  E  X 
to  a  binary  sequence  <p(x)  E  lj“  0{0,  l}q  (the  codeword).  The  decoder,  0,  maps  any  element 
c  E  U^=o{0>  l}9  t0  an  element  0(c)  G  X*.  For  any  q  E  {0, 1, . . .}  and  c  E  {0,  l}q,  let  |c|  =  q 
denote  the  length  of  c.  A  prefix-free  code  is  any  code  (0,0)  such  that  no  x\,x2  E  X  have 
c l1!  =  and  c ^  =  ^{xf)  with  c ^  but  Vi  <  | | ,  c-2"1  =  c\^:  that  is,  no  codeword  is 

a  prefix  of  another  (longer)  codeword.  Let  PF  denote  the  set  of  all  prefix-free  binary  codes. 

Here,  we  consider  a  setting  where  the  code  (0, 0)  may  be  lossy,  in  the  sense  that  for  some 
values  of  x  E  X,  p(0(0(x)),  x)  >  0.  Our  objective  is  to  design  the  code  to  have  small  expected 
loss  (in  the  p  sense),  while  maintaining  as  small  of  an  expected  codeword  length  as  possible. 
Formally,  we  have  the  following  definition,  which  essentially  describes  a  notion  of  optimality 
for  a  lossy  code. 

Definition  5.1.  For  any  D  >  0,  define  the  optimal  rate  at  distortion  D 


R(D)  =  inf  \  E  |0(X)| 


(0, 0)  E  PF  with 


E 


p  0(0(X)),X 


where  the  random  variable  in  both  expectations  is  X  ~  ir. 

For  our  analysis,  we  will  require  a  notion  of  dimensionality  for  the  pseudo-metric  p.  For  this, 

2The  set  X*  will  not  play  any  significant  role  in  the  analysis,  except  to  allow  for  improper  learning  scenarios  to 
be  a  special  case  of  our  setting. 
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we  adopt  the  well-known  doubling  dimension  [Gupta,  Krauthgamer,  and  Lee,  20031. 

Definition  5.2.  Define  the  doubling  dimension  d  as  the  smallest  value  d  such  that,  for  any  x  G  X, 
and  any  e  >  0,  the  size  of  the  minimal  e/2  -cover  of  the  e-radius  bail  around  x  is  at  most  2d. 

That  is,  for  any  x  G  X  and  e  >  0,  there  exists  a  set  of  2d  elements  of  X  such  that 

2d 

{x'  G  X  :  p(x',x)  <  e}  C  l>'  G  X  :  p{x',Xi )  <  e/2}. 

i= 1 

Note  that,  as  defined  here,  d  is  a  constant  (i.e.,  has  no  dependence  on  the  x  or  e  in  its  defini¬ 
tion).  In  the  analysis  below,  we  will  always  assume  d  <  oo.  The  doubling  dimension  has  been 
studied  for  a  variety  of  spaces,  originally  by  Gupta,  Krauthgamer,  &  Lee  [Gupta,  Krauthgamer, 
and  Lee,  2003],  and  subsequently  by  many  others.  In  particular,  Bshouty,  Li,  &  Long  [Bshouty, 
Li,  and  Long,  2009]  discuss  the  doubling  dimension  of  spaces  X  of  binary  classifiers,  in  the 
context  of  statistical  learning  theory. 

5.2.1  Definition  of  Packing  Entropy 

Our  main  result  concerns  the  relation  between  the  optimal  rate  at  a  given  distortion  with  the 
entropy  of  a  certain  quantizer.  We  now  turn  to  defining  this  latter  quantity. 

Definition  5.3.  For  any  D  >  0,  define  3^(D)  C  X  as  a  maximal  12) -packing  of  X.  That  is, 
Vxi,X2  G  3^(D),  p(xi,x 2)  >  D,  and\/x  G  X  \  y/D),  mhvg^D)  p(x,x')  <  D. 

For  our  purposes,  if  multiple  maximal  D -packings  are  possible,  we  can  choose  to  define 
^(D)  arbitrarily  from  among  these;  the  results  below  hold  for  any  such  choice.  Recall  that  any 
maximal  D-packing  of  X  is  also  a  D-cover  of  X,  since  otherwise  we  would  be  able  to  add  to 
y( D)  the  x  G  X  that  escapes  the  cover.  That  is,  Vx  G  X,  3y  G  ^(D)  s.t.  p{x,  y )  <  D. 

Next  we  define  a  complexity  measure,  a  type  of  entropy,  which  serves  as  our  primary  quantity 
of  interest  in  the  analysis  of  R(D) .  It  is  specified  in  terms  of  a  partition  induced  by  3^(D),  defined 
as  follows. 
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Definition  5.4.  For  any  D  >  0,  define 


Q(  D)  =  x  G  X  :  z  =  argmin  p(x,  y)  >  :  z  G  3^(D)  >  , 

[  [  yey(D)  J  J 

where  we  break  ties  in  the  argmin  arbitrarily  but  consistently  (e.g.,  based  on  a  predefined  pref¬ 
erence  ordering  ofy(D)). 

Definition  5.5.  For  any  finite  (or  countable)  partition  S  of  X  into  measurable  regions  (subsets), 
define  the  entropy  of  S 

$ )  =  -  log2  n(S^ 

Ses 

In  particular,  we  will  be  interested  in  the  quantity  'H(Q( D))  in  the  analysis  below. 

5.3  Main  Result 

Our  main  result  can  be  summarized  as  follows.  Note  that,  since  we  took  the  distribution  n  to  be 
arbitrary  in  the  above  definitions,  this  result  holds  for  any  given  7r. 

Theorem  5.6.  If  d  <  oo  and  p  <  oo,  then  there  is  a  constant  c  =  0(d)  such  that  VD  e  (0,  p/2), 

n(Q(  D  log2(p/D)))  -  c  <  R(D)  <FL(Q  (D))  +  1. 

It  should  not  be  surprising  that  entropy  terms  play  a  key  role  in  this  result,  as  the  entropy  is 
essential  to  the  analysis  of  exact  coding  [Shannon,  1948].  Furthermore,  R(D)  is  tightly  charac¬ 
terized  by  the  minimum  achievable  entropy  among  all  quantizers  of  distortion  at  most  D  [Kieffer, 
1993].  The  interesting  aspect  of  Theorem  5.6  is  that  we  can  explicitly  describe  a  particular  quan¬ 
tizer  with  near-optimal  rate,  and  its  entropy  can  be  explicitly  calculated  for  a  variety  of  scenarios 
( X ,  p,  7 r).  As  for  the  behavior  of  R(D)  within  the  range  between  the  upper  and  lower  bounds 
of  Theorem  5.6,  we  should  expect  the  upper  bound  to  be  tight  when  high-probability  subsets  of 
the  regions  in  <2(D)  are  point-wise  well-separated,  while  R(D)  may  be  much  smaller  (perhaps 
closer  to  the  lower  bound)  when  this  is  violated  to  a  large  degree,  for  reasons  described  in  the 
proof  below. 
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H(P(D)) 


Figure  5.1:  Plots  of  T-L(Q(D ))  as  a  function  of  1/D,  for  various  distributions  tt  on  X  —  R. 

Although  this  result  is  stated  for  bounded  psuedo-metrics  p,  it  also  has  implications  for  un¬ 
bounded  p.  In  particular,  the  proof  of  the  upper  bound  holds  as-is  for  unbounded  p.  Furthermore, 
we  can  always  use  this  lower  bound  to  construct  a  lower  bound  for  unbounded  p,  simply  restrict¬ 
ing  to  a  bounded  subset  of  X  with  constant  probability  and  calculating  the  lower  bound  for  that 
region.  For  instance,  to  get  a  lower  bound  for  7r  as  a  Gaussian  distribution  on  R,  we  might  note 
that  7r([— 1/2, 1/2])  times  the  expected  loss  under  the  conditional  7r(-|  [— 1/2, 1/2])  lower  bounds 
the  total  expected  loss.  Thus,  calculating  the  lower  bound  of  Theorem  5.6  under  the  conditional 
7r(-| [ — 1/2, 1/2])  while  replacing  D  with  D/7r([— 1/2, 1/2])  provides  a  lower  bound  on  R(D). 

To  get  a  feel  for  the  behavior  of  LL(Q  (D)),  we  have  plotted  it  as  a  function  of  1/D  for  several 
distributions,  in  Figure  5.1. 

5.4  Proof  of  Theorem  5.6 

We  first  state  a  lemma,  due  to  Gupta,  Krauthgamer,  &  Lee  [Gupta,  Krauthgamer,  and  Lee,  2003], 
which  will  be  useful  in  the  proof  of  Theorem  5.6. 

Lemma  5.7.  [Gupta,  Krauthgamer,  and  Lee,  2003]  For  any  7  6  (0,  00),  5  e  [7,  00),  and  x  6  X, 

17'  £V(7):p(V,l)<  7|  <(^)  . 

In  particular,  note  that  this  lemma  implies  that  the  minimum  of  p(x,y )  over  y  e  Jh  D  )  is 
always  achieved  in  Definition  5.4,  so  that  Q(D)  is  well-defined. 
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We  are  now  ready  for  the  proof  of  Theorem  5.6. 


Proof  of  Theorem  5.6.  Throughout  the  proof,  we  will  consider  a  set- valued  random  quantity 
Qd(X)  with  value  equal  to  the  set  in  Q(D)  containing  X,  and  a  corresponding  X -valued  random 
quantity  Yy>{X)  with  value  equal  the  sole  point  in  OxfX)  D  3^(D):  that  is,  the  target’s  nearest 
representative  in  the  D-packing.  Note  that,  by  Lemma  5.7,  |^(D)|  <  oo  for  all  D  e  (0, 1).  We 
will  also  adopt  the  usual  notation  for  entropy  (e.g.,  'HiQxf  X)))  and  conditional  entropy  (e.g., 
TL{Qy>{X)\Z))  [Cover  and  Thomas,  2006],  both  in  base  2. 

To  establish  the  upper  bound,  we  simply  take  <p  as  the  Huffman  code  for  the  random  quantity 
Qd(X)  [Cover  and  Thomas,  2006,  Huffman,  1952].  It  is  well-known  that  the  expected  length 
of  a  Huffman  code  for  Qn(X)  is  at  most  TL{Qt>{X))  +  1  (in  fact,  is  equal  TL(Qy>{X))  when 
the  probabilities  are  powers  of  2)  [Cover  and  Thomas,  2006,  Huffman,  1952],  and  each  possible 
value  of  Qn(X)  is  assigned  a  unique  codeword  so  that  we  can  perfectly  recover  Qn(X)  (and  thus 
also  yD(X))  based  on  e>(X).  In  particular,  define  U}(f)(X))  =  Yn(X).  Finally,  recall  that  any 
maximal  D-packing  is  also  a  D -cover.  Thus,  since  every  element  of  the  set  QniX)  has  YfiX)  as 
its  closest  representative  in  3^(D),  we  must  have  p( X,  tp(f)(X)))  =  p(X,  FD(9f))  <  D.  In  fact, 
as  this  proof  never  relies  on  p  <  oo,  this  establishes  the  upper  bound  even  in  the  case  p  =  oo. 

The  proof  of  the  lower  bound  is  somewhat  more  involved,  though  the  overall  idea  is  simple 
enough.  Essentially,  the  lower  bound  would  be  straightforward  if  the  regions  of  Q(  D  log2(p/D)) 
were  separated  by  some  distance,  since  we  could  make  an  argument  based  on  Fano’s  inequality 
to  say  that  since  any  X  =  U’((f)(X))  is  “close”  to  at  most  one  region,  the  expected  distance 
from  X  is  at  least  as  large  as  half  this  inter-region  distance  times  a  quantity  proportional  to  the 
conditional  entropy  'H(Qr>(X)\(t>(X)) ,  so  that  'H(f(X))  can  be  related  to  TL(Qy>(X)). 

However,  the  general  case  is  not  always  so  simple,  as  the  regions  can  generally  be  quite  close 
to  each  other  (even  adjacent),  so  that  it  is  possible  for  X  to  be  close  to  multiple  regions.  Thus,  the 
proof  will  first  “color”  the  regions  of  Q(D  log2(p/D))  in  a  way  that  guarantees  no  two  regions  of 
the  same  color  are  within  distance  D  log2(p/D)  of  each  other.  Then  we  apply  the  above  simple 


85 


argument  for  each  color  separately  (i.e.,  lower  bounding  the  expected  distance  from  X  under  the 
conditional  given  the  color  of  QrMog2(p/D)(^0  by  a  function  of  the  conditional  entropy  under  the 
conditional),  and  average  over  the  colors  to  get  a  global  lower  bound.  The  details  follow. 

Fix  any  D  G  (0,  p/2),  and  for  brevity  let  a  =  Dlog2(p/D).  We  suppose  (</>,'?/’)  is  some 
prefix-free  binary  code. 

Define  a  function  /C  :  Q(a)  N  such  that  VQi,  Q2  G  Q(a), 


1C(Qi)  =  JC(Q2)  =>•  inf  p(xi,x2)>a,  (5.1) 

Xl£Ql,X2£Q2 

and  suppose  fC  has  minimum  'H(JC(Qa(X)))  subject  to  (5.1).  We  will  refer  to  JC(Q)  as  the  color 
of  Q. 

Now  we  are  ready  to  bound  the  expected  distance  from  X.  Let  X  =  v>(0(X)),  and  let 
Qa( X;  X)  denote  the  set  0  G  Q(o)  having  KMJ)  =  /C  with  smallest  inf xeq  p(x,  X)  (breaking 
ties  arbitrarily).  We  know 


E \p{X,X)]  =  E  \E[p(X,X)\JC(Qa(X))\ 


(5.2) 


Furthermore,  by  (5.1)  and  a  triangle  inequality,  we  know  no  X  can  be  closer  than  a/2  to  more 
than  one  Q  G  Q(a)  of  a  given  color.  Therefore, 


E{p(X,X)\JC(Qa(X))\ 

>  |P (Qa(X;X(Qa(X)))  ±  Qa(X)\K(Qa(X))).  (5.3) 

By  Fano’s  inequality,  we  have 


E 


P (Qa(X-,X(Qa(X)))  ^  Qa{X)\JC{Qa(X))) 


n(Qa{xmx),ic{Qa(x)))-i 

iog2\y(a)\  ■ 


It  is  generally  true  that,  for  a  prefix-free  binary  code  (f>(X),  (f>(X)  is  a  lossless  prefix-free 
binary  code  for  itself  (i.e.,  with  the  identity  decoder),  so  that  the  classic  entropy  lower  bound  on 
average  code  length  [Cover  and  Thomas,  2006,  Shannon,  1948]  implies  7-L(<j>(X ))  <  E[|0(A")|]. 
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Also,  recalling  that  y(a)  is  maximal,  and  therefore  also  an  a-cover,  we  have  that  any  Q1,  Q2  E 
Q(a)  with  inf  p(x i,x2)  <  a  have  p(Ya(x1),Ya(x 2))  <  3a:  (by  a  triangle  inequality). 

Xl£Ql,X2£Q2 

Therefore,  Lemma  5.7  implies  that,  for  any  given  Q\  E  Q(a),  there  are  at  most  I2d  sets  Q2  € 

Q{ot)  with  inf  p(x i,  x2)  <  a.  We  therefore  know  there  exists  a  function  /C'  :  Q(a)  —>  N 

xi£Qi,x2£Q2 

satisfying  (5.1)  such  that  max  JC(Q)  <  12d  (i.e.,  we  need  at  most  12d  colors  to  satisfy  (5.1)). 

Q£Q(a) 

That  is,  if  we  consider  coloring  the  sets  Q  E  Q(a)  sequentially,  for  any  given  0  \  not  yet  colored, 
there  are  <  12d  sets  Q2  E  Q(a )  \  {Qi}  within  a  of  it,  so  there  must  exist  a  color  among 
{1, . . . ,  12d}  not  used  by  any  of  them,  and  we  can  choose  that  for  JC(Q i).  In  particular,  by  our 
choice  of  /C  to  minimize  V,(IC(Qa(X)))  subject  to  (5.1),  this  implies 

n(ic{Qa{x)))  <  n{ic'{Qa{x)))  <  \og2(i2d)  <  m. 

Thus, 


n{Qa{x)\<i>{x),K;{Qa{x))) 

=  H(Qa(X),(i>(X),lC(Qa(X))) 

-H{<KX))-H(lC(Qa{X))\<i>(X)) 
>  n(Qa(x))  -  n{(t>{x))  -  H(JC(Qa(x))) 
>n(Qa(X))-E[\<j>(X)\]-Ad 
=  n{Q{a))-EMX)\]Y*4d. 


Thus,  combining  (5.2),  (5.3),  (5.4),  and  (5.5),  we  have 


np{x,x)\  > 
> 


aH{Q{a))  -E[|0(X)|]  -Ad-  1 
2  log2|^(a)| 

a'H{Q(a))  -E[|0(X)|]  -Ad-  1 
2  dlog2(4p/a) 


where  the  last  inequality  follows  from  Lemma  5.7. 

Thus,  for  any  code  with 

EWP0I1  <  H(Q(a))  -id-1- 


(5.5) 
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we  have  E[p(X,  X)]  >  D,  which  implies 


R(D)  >  H(Q(a))  -id-1-  2dl0fe(4p/°). 

l°g2(p/D) 

Since  log2(4p/D)/  log2(p/D)  <  3,  we  have 

R(D)  >H(Q(a))-0{d). 


□ 


5.5  Application  to  Bayesian  Active  Learning 

As  an  example,  in  the  special  case  of  the  problem  of  learning  a  binary  classifier,  as  studied  by 
[Haussler,  Kearns,  and  Schapire,  1994a]  and  [Freund,  Seung,  Shamir,  and  Tishby,  1997],  X*  is 
the  set  of  all  measurable  classifiers  h  :  Z  — >  {— 1,  +1},  X  is  called  the  “concept  space,”  A"  is 
called  the  “target  function,”  and  p(Xi,  X2)  =  PfA",  (Z)  ^  X2(Z)),  where  Z  is  some  A- valued 
random  variable.  In  particular,  p(X i,  X)  is  called  the  “error  rate”  of  X\. 

We  may  then  discuss  a  learning  protocol  based  on  binary-valued  queries.  That  is,  we  sup¬ 
pose  some  learning  machine  is  able  to  pose  yes/no  questions  to  an  oracle,  and  based  on  the 
responses  it  proposes  a  hypothesis  X.  We  may  ask  how  many  such  yes/no  questions  must  the 
learning  machine  pose  (in  expectation)  before  being  able  to  produce  a  hypothesis  X  e  X*  with 
E[p(X,  A")]  <  e,  known  as  the  query  complexity. 

If  the  learning  machine  is  allowed  to  pose  arbitrary  binary-valued  queries,  then  this  setting  is 
precisely  a  special  case  of  the  general  lossy  coding  problem  studied  above.  That  is,  any  learning 
machine  that  asks  a  sequence  of  yes/no  questions  before  terminating  and  returning  some  X  e  X* 
can  be  thought  of  as  a  binary  decision  tree  (no  =  left,  yes  =  right),  with  the  return  X  values  stored 
in  the  leaf  nodes.  Transforming  each  root-to-leaf  path  in  the  decision  tree  into  a  codeword  (left 
=  0,  right  =  1),  we  see  that  the  algorithm  corresponds  to  a  prefix-free  binary  code.  Conversely, 
given  any  prefix-free  binary  code,  we  can  construct  an  algorithm  based  on  sequentially  asking 
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queries  of  the  form  “what  is  the  first  bit  in  the  codeword  o(X)  for  XT',  “what  is  the  second  bit  in 
the  codeword  for  XT’,  etc.,  until  we  obtain  a  complete  codeword,  at  which  point  we  return 
the  value  that  codeword  decodes  to.  From  this  perspective,  the  query  complexity  is  precisely 
R(e). 

This  general  problem  of  learning  with  arbitrary  binary-valued  queries  was  studied  previously 
by  Kulkarni,  Mitter,  &  Tsitsiklis  [Kulkarni,  Mitter,  and  Tsitsiklis,  1993],  in  a  minimax  analysis 
(studying  the  worst-case  value  of  X).  In  particular,  they  find  that  for  a  given  distribution  for 
Z,  the  worst-case  query  complexity  is  essentially  characterized  by  log  |^(e)|.  The  techniques 
employed  are  actually  far  more  general  than  the  classifier-learning  problem,  and  actually  apply 
to  any  pseudo-metric  space.  Thus,  we  can  abstractly  think  of  their  work  as  a  minimax  analysis 
of  lossy  coding. 

In  addition  to  being  quite  interesting  in  their  own  right,  the  results  of  Kulkarni,  Mitter,  & 
Tsitsiklis  [Kulkarni,  Mitter,  and  Tsitsiklis,  1993]  have  played  a  significant  role  in  the  recent 
developments  in  active  learning  with  label  request  queries  for  binary  classification  [Dasgupta, 
2005,  Hanneke,  2007a,b],  in  which  the  learning  machine  may  only  ask  questions  of  the  form, 
“What  is  the  value  X(z)T  for  certain  values  z  G  Z.  Since  label  requests  can  be  viewed  as 
a  type  of  binary-valued  query,  the  number  of  label  requests  necessary  for  learning  is  naturally 
lower  bounded  by  the  number  of  arbitrary  binary-valued  queries  necessary  for  learning.  We 
therefore  always  expect  to  see  some  term  relating  to  log  |^(e)  |  in  any  minimax  query  complexity 
results  for  active  learning  with  label  requests  (though  this  factor  is  typically  represented  by  its 
upper  bound:  oc  V  ■  log(l/e),  where  V  is  the  VC  dimension). 

Similarly  to  how  the  work  of  Kulkarni,  Mitter,  &  Tsitsiklis  [Kulkarni,  Mitter,  and  Tsitsiklis, 
1993]  can  be  used  to  argue  that  log  |3^(e)  |  is  a  lower  bound  on  the  minimax  query  complexity  of 
active  learning  with  label  requests,  Theorem  5.6  can  be  used  to  argue  that  TL(Q(e  log2(l/e)))  — 
0(d)  is  a  lower  bound  on  the  query  complexity  of  learning  relative  to  a  given  distribution  for 
X  (called  a  prior,  in  the  language  of  Bayesian  statistics),  rather  than  the  worst-case  value  of  X. 
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Furthermore,  as  with  [Kulkarni,  Mitter,  and  Tsitsiklis,  1993],  this  lower  bound  remains  valid  for 
learning  with  label  requests,  since  label  requests  are  a  type  of  binary-valued  query.  Thus,  we 
should  expect  a  term  related  to  TL(Q(e))  or  "H(<2(e  log2(l/e)))  to  appear  in  any  tight  analysis  of 
the  query  complexity  of  Bayesian  learning  with  label  requests. 

5.6  Open  Problems 

In  our  present  context,  there  are  several  interesting  questions,  such  as  whether  the  log(p/D)  factor 
in  the  entropy  argument  of  the  lower  bound  can  be  removed,  whether  the  additive  constant  in  the 
lower  bound  might  be  improved,  and  in  particular  whether  a  similar  result  might  be  obtained 
without  assuming  d  <  oo  (e.g.,  in  the  statistical  learning  special  case,  by  making  a  VC  class 
assumption  instead). 
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Chapter  6 


The  Sample  Complexity  of  Self-Verifying 
Bayesian  Active  Learning 


Abstract 

'We  prove  that  access  to  a  prior  distribution  over  target  functions  can  dramatically  improve  the 
sample  complexity  of  self-terminating  active  learning  algorithms,  so  that  it  is  always  better  than 
the  known  results  for  prior-dependent  passive  learning.  In  particular,  this  is  in  stark  contrast  to 
the  analysis  of  prior-independent  algorithms,  where  there  are  simple  known  learning  problems 
for  which  no  self-terminating  algorithm  can  provide  this  guarantee  for  all  priors. 


6.1  Introduction  and  Background 

Active  learning  is  a  powerful  form  of  supervised  machine  learning  characterized  by  interaction 
between  the  learning  algorithm  and  supervisor  during  the  learning  process.  In  this  work,  we 
consider  a  variant  known  as  pool-based  active  learning,  in  which  a  learning  algorithm  is  given 
access  to  a  (typically  very  large)  collection  of  unlabeled  examples,  and  is  able  to  select  any  of 
1  Joint  work  with  Jaime  Carbonell  and  Steve  Hanneke. 
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those  examples,  request  the  supervisor  to  label  it  (in  agreement  with  the  target  concept),  then  after 
receiving  the  label,  selects  another  example  from  the  pool,  etc.  This  sequential  label-requesting 
process  continues  until  some  halting  criterion  is  reached,  at  which  point  the  algorithm  outputs 
a  function,  and  the  objective  is  for  this  function  to  closely  approximate  the  (unknown)  target 
concept  in  the  future.  The  primary  motivation  behind  pool-based  active  learning  is  that,  often, 
unlabeled  examples  are  inexpensive  and  available  in  abundance,  while  annotating  those  examples 
can  be  costly  or  time-consuming;  as  such,  we  often  wish  to  select  only  the  informative  examples 
to  be  labeled,  thus  reducing  information-redundancy  to  some  extent,  compared  to  the  baseline  of 
selecting  the  examples  to  be  labeled  uniformly  at  random  from  the  pool  (passive  learning). 

There  has  recently  been  an  explosion  of  fascinating  theoretical  results  on  the  advantages  of 
this  type  of  active  learning,  compared  to  passive  learning,  in  terms  of  the  number  of  labels  re¬ 
quired  to  obtain  a  prescribed  accuracy  (called  the  sample  complexity ):  e.g.,  [Balcan,  Broder,  and 
Zhang,  2007a,  Balcan,  Beygelzimer,  and  Langford,  2009,  Balcan,  Hanneke,  and  Vaughan,  2010, 
Beygelzimer,  Dasgupta,  and  Langford,  2009,  Castro  and  Nowak,  2008,  Dasgupta,  2004,  2005, 
Dasgupta,  Hsu,  and  Monteleoni,  2007b,  Dasgupta,  Kalai,  and  Monteleoni,  2009,  Freund,  Seung, 
Shamir,  and  Tishby,  1997,  Friedman,  2009,  Hanneke,  2007a,b,  2009,  2011,  Kaariainen,  2006, 
Koltchinskii,  2010,  Nowak,  2008,  Wang,  2009].  In  particular,  [Balcan,  Hanneke,  and  Vaughan, 
2010]  show  that  in  noise-free  binary  classifier  learning,  for  any  passive  learning  algorithm  for  a 
concept  space  of  finite  VC  dimension,  there  exists  an  active  learning  algorithm  with  asymptoti¬ 
cally  much  smaller  sample  complexity  for  any  nontrivial  target  concept.  In  later  work,  [Hanneke, 
2009]  strengthens  this  result  by  removing  a  certain  strong  dependence  on  the  distribution  of  the 
data  in  the  learning  algorithm.  Thus,  it  appears  there  are  profound  advantages  to  active  learning 
compared  to  passive  learning. 

However,  the  ability  to  rapidly  converge  to  a  good  classifier  using  only  a  small  number  of 
labels  is  only  one  desirable  quality  of  a  machine  learning  method,  and  there  are  other  qualities 
that  may  also  be  important  in  certain  scenarios.  In  particular,  the  ability  to  verify  the  performance 
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of  a  learning  method  is  often  a  crucial  part  of  machine  learning  applications,  as  (among  other 
things)  it  helps  us  determine  whether  we  have  enough  data  to  achieve  a  desired  level  of  accuracy 
with  the  given  method.  In  passive  learning,  one  common  practice  for  this  verification  is  to  hold 
out  a  random  sample  of  labeled  examples  as  a  validation  sample  to  evaluate  the  trained  classifier 
(e.g.,  to  determine  when  training  is  complete).  It  turns  out  this  technique  is  not  feasible  in  active 
learning,  since  in  order  to  be  really  useful  as  an  indicator  of  whether  we  have  seen  enough  la¬ 
bels  to  guarantee  the  desired  accuracy,  the  number  of  labeled  examples  in  the  random  validation 
sample  would  need  to  be  much  larger  than  the  number  of  labels  requested  by  the  active  learning 
algorithm  itself,  thus  (to  some  extent)  canceling  the  savings  obtained  by  performing  active  rather 
than  passive  learning.  Another  common  practice  in  passive  learning  is  to  examine  the  training  er¬ 
ror  rate  of  the  returned  classifier,  which  can  serve  as  a  reasonable  indicator  of  performance  (after 
adjusting  for  model  complexity).  However,  again  this  measure  of  performance  is  not  necessarily 
reasonable  for  active  learning,  since  the  set  of  examples  the  algorithm  requests  the  labels  of  is 
typically  distributed  very  differently  from  the  test  examples  the  classifier  will  be  applied  to  after 
training. 

This  reasoning  indicates  that  performance  verification  is  (at  best)  a  far  more  subtle  issue  in 
active  learning  than  in  passive  learning.  Indeed,  [Balcan,  Hanneke,  and  Vaughan,  2010]  note  that 
although  the  number  of  labels  required  to  achieve  good  accuracy  is  significantly  smaller  than 
passive  learning,  it  is  often  the  case  that  the  number  of  labels  required  to  verify  that  the  accuracy 
is  good  is  not  significantly  improved.  In  particular,  this  phenomenon  can  dramatically  increase 
the  sample  complexity  of  active  learning  algorithms  that  adaptively  determine  how  many  labels 
to  request  before  terminating.  In  short,  if  we  require  the  algorithm  both  to  learn  an  accurate 
concept  and  to  know  that  its  concept  is  accurate,  then  the  number  of  labels  required  by  active 
learning  is  often  not  significantly  smaller  than  the  number  required  by  passive  learning. 

We  should  note,  however,  that  the  above  results  were  proven  for  a  learning  scenario  in  which 
the  target  concept  is  considered  a  constant,  and  no  information  about  the  process  that  generates 
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this  concept  is  known  a  priori.  Alternatively,  we  can  consider  a  modification  of  this  problem,  so 
that  the  target  concept  can  be  thought  of  as  a  random  variable,  a  sample  from  a  known  distribution 
(called  a  prior)  over  the  space  of  possible  concepts.  Such  a  setting  has  been  studied  in  detail 
in  the  context  of  passive  learning  for  noise-free  binary  classification.  In  particular,  [Haussler, 
Kearns,  and  Schapire,  1994a]  found  that  for  any  concept  space  of  finite  VC  dimension  d,  for 
any  prior  and  distribution  over  data  points,  0(d/e )  random  labeled  examples  are  sufficient  for 
the  expected  error  rate  of  the  Bayes  classifier  produced  under  the  posterior  distribution  to  be  at 
most  e.  Furthermore,  it  is  easy  to  construct  learning  problems  for  which  there  is  an  0.(1/ s)  lower 
bound  on  the  number  of  random  labeled  examples  required  to  achieve  expected  error  rate  at  most 
e,  by  any  passive  learning  algorithm;  for  instance,  the  problem  of  learning  threshold  classifiers 
on  [0, 1]  under  a  uniform  data  distribution  and  uniform  prior  is  one  such  scenario. 

In  the  context  of  active  learning  (again,  with  access  to  the  prior),  [Freund,  Seung,  Shamir,  and 
Tishby,  1997]  analyze  the  Query  by  Committee  algorithm,  and  find  that  if  a  certain  information 
gain  quantity  for  the  points  requested  by  the  algorithm  is  lower-bounded  by  a  value  g,  then  the 
algorithm  requires  only  0((d/g)  log(l/e))  labels  to  achieve  expected  error  rate  at  most  e.  In  par¬ 
ticular,  they  show  that  this  is  satisfied  for  constant  g  for  linear  separators  under  a  near-uniform 
prior,  and  a  near-uniform  data  distribution  over  the  unit  sphere.  This  represents  a  marked  im¬ 
provement  over  the  results  of  [Haussler,  Kearns,  and  Schapire,  1994a]  for  passive  learning,  and 
since  the  Query  by  Committee  algorithm  is  self- verifying,  this  result  is  highly  relevant  to  the 
present  discussion.  However,  the  condition  that  the  information  gains  be  lower-bounded  by  a 
constant  is  quite  restrictive,  and  many  interesting  learning  problems  are  precluded  by  this  re¬ 
quirement.  Furthermore,  there  exist  learning  problems  (with  finite  VC  dimension)  for  which  the 
Query  by  Committee  algorithm  makes  an  expected  number  of  label  requests  exceeding  0(1 /e). 
To  date,  there  has  not  been  a  general  analysis  of  how  the  value  of  g  can  behave  as  a  function  of 
e,  though  such  an  analysis  would  likely  be  quite  interesting. 

In  the  present  paper,  we  take  a  more  general  approach  to  the  question  of  active  learning  with 
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access  to  the  prior.  We  are  interested  in  the  broad  question  of  whether  access  to  the  prior  bridges 
the  gap  between  the  sample  complexity  of  learning  and  the  sample  complexity  of  learning  with 
verification.  Specifically,  we  ask  the  following  question. 

Can  a  prior-depenclent  self-terminating  active  learning  algorithm  for  a  concept  class  of  finite 
VC  dimension  always  achieve  expected  error  rate  at  most  e  using  o{l/e)  label  requests? 

After  some  basic  definitions  in  Section  6.2,  we  begin  in  Section  6.4  with  a  concrete  example, 
namely  interval  classifiers  under  a  uniform  data  density  but  arbitrary  prior,  to  illustrate  the  general 
idea,  and  convey  some  of  the  intuition  as  to  why  one  might  expect  a  positive  answer  to  this 
question.  In  Section  6.5,  we  present  a  general  proof  that  the  answer  is  always  “yes.”  As  the 
known  results  for  the  sample  complexity  of  passive  learning  with  access  to  the  prior  are  typically 
oc  1/e  [Haussler,  Kearns,  and  Schapire,  1994a],  and  this  is  sometimes  tight,  this  represents 
an  improvement  over  passive  learning.  The  proof  is  simple  and  accessible,  yet  represents  an 
important  step  in  understanding  the  problem  of  self-termination  in  active  learning  algorithms,  and 
the  general  issue  of  the  complexity  of  verification.  Also,  as  this  is  a  result  that  does  not  generally 
hold  for  prior-independent  algorithms  (even  for  their  “average-case”  behavior  induced  by  the 
prior)  for  certain  concept  spaces,  this  also  represents  a  significant  step  toward  understanding  the 
inherent  value  of  having  access  to  the  prior. 

6.2  Definitions  and  Preliminaries 

First,  we  introduce  some  notation  and  formal  definitions.  We  denote  by  X  the  instance  space , 
representing  the  range  of  the  unlabeled  data  points,  and  we  suppose  a  distribution  V  on  X, 
which  we  will  refer  to  as  the  data  distribution.  We  also  suppose  the  existence  of  a  sequence 
Xi,X2, ...  of  i.i.d.  random  variables,  each  with  distribution  V,  referred  to  as  the  unlabeled 
data  sequence.  Though  one  could  potentially  analyze  the  achievable  performance  as  a  function 
of  the  number  of  unlabeled  points  made  available  to  the  learning  algorithm  (cf.  [Dasgupta, 
2005]),  for  simplicity  in  the  present  work,  we  will  suppose  this  unlabeled  sequence  is  essentially 
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inexhaustible,  corresponding  to  the  practical  fact  that  unlabeled  data  are  typically  available  in 
abundance  as  they  are  often  relatively  inexpensive  to  obtain.  Additionally,  there  is  a  set  C  of 
measurable  classifiers  h  :  X  — *  {  —  1,  +1},  referred  to  as  the  concept  space.  We  denote  by  d 
the  VC  dimension  of  C,  and  in  our  present  context  we  will  restrict  ourselves  to  spaces  C  with 
d  <  oo,  referred  to  as  a  VC  class.  We  also  have  a  probability  distribution  it,  called  the  prior, 
over  C,  and  a  random  variable  h*  ~  7 r,  called  the  target  function-,  we  suppose  h*  is  independent 
from  the  data  sequence  Xi,X2, ....  We  adopt  the  usual  notation  for  conditional  expectations 
and  probabilities  [Ash  and  Doleans-Dade,  2000];  for  instance,  E [A  \  B  can  be  thought  of  as  an 
expectation  of  the  value  A,  under  the  conditional  distribution  of  A  given  the  value  of  B  (which 
itself  is  random),  and  thus  the  value  of  E[A|E>|  is  essentially  determined  by  the  value  of  B.  For 
any  measurable  h  :  X  — >  {— 1,+1},  define  the  error  rate  er (h)  =  V({x  :  h{x )  f  h*(x)}). 
So  far,  this  setup  is  essentially  identical  to  that  of  [Freund,  Seung,  Shamir,  and  Tishby,  1997, 
Haussler,  Kearns,  and  Schapire,  1994a], 

The  protocol  in  active  learning  is  the  following.  An  active  learning  algorithm  A  is  given  as 
input  the  prior  it,  the  data  distribution  V  (though  see  Section  6.6),  and  a  value  £  e  (0, 1].  It 
also  (implicitly)  depends  on  the  data  sequence  X1:  X2,  ■  ■ .,  and  has  an  indirect  dependence  on 
the  target  function  h*  via  the  following  type  of  interaction.  The  algorithm  may  inspect  the  values 
Xt  for  any  initial  segment  of  the  data  sequence,  select  an  index  i  e  N  to  “request”  the  label  of; 
after  selecting  such  an  index,  the  algorithm  receives  the  value  h*(Xi).  The  algorithm  may  then 
select  another  index,  request  the  label,  receive  the  value  of  h*  on  that  point,  etc.  This  happens 
for  a  number  of  rounds,  N(A,  h*,  e,  V,  it),  before  eventually  the  algorithm  halts  and  returns  a 
classifier  h.  An  algorithm  is  said  to  be  correct  if  E  er  (j ij  <  e  for  every  (s,  V,  i r);  that  is, 
given  direct  access  to  the  prior  and  the  data  distribution,  and  given  a  specified  value  e,  a  correct 
algorithm  must  be  guaranteed  to  have  expected  error  rate  at  most  e.  Define  the  expected  sample 
complexity  of  A  for  (X ,C,'D,n)  to  be  the  function  SC(£,V,ir)  =  K[N(A,h*,£,'D,Tr)]:  the 
expected  number  of  label  requests  the  algorithm  makes. 
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6.3  Prior-Independent  Learning  Algorithms 


One  may  initially  wonder  whether  we  could  achieve  this  o(  1/e)  result  merely  by  calculating 
the  expected  sample  complexity  of  some  prior-independent  method,  thus  precluding  the  need 
for  novel  algorithms.  Formally,  we  say  an  algorithm  A  is  prior-independent  if  the  conditional 
distribution  of  the  queries  and  return  value  of  A(e,  V,  it)  given  {(X] ,  X(Xi)),  (X2,  X  (X2 ) ) . . . .} 
is  functionally  independent  of  -k.  Indeed,  for  some  C  and  V,  it  is  known  that  there  are  prior- 
independent  active  learning  algorithms  A  that  have  E[N(A,X,£,V,it)\X]  =  o(  1/e)  (always); 
for  instance,  threshold  classifiers  have  this  property  under  any  T>,  homogeneous  linear  separators 
have  this  property  under  a  uniform  D  on  the  unit  sphere  in  k  dimensions,  and  intervals  with 
positive  width  on  X  =  [0, 1]  have  this  property  under  V  =  UniformQO,  1])  (see  e.g.,  [Dasgupta, 
2005]).  It  is  straightforward  to  show  that  any  such  A  will  also  have  SC  (A,  £,  V ,  n)  =  o(  1/e) 
for  every  it.  In  particular,  the  law  of  total  expectation  and  the  dominated  convergence  theorem 
imply 


lim£SC{A,£,V,ir )  =  lim£E[E[N(A,X,£,V,iT)\X]\ 

£ — ^0  £— >0 


=  E 


lim  eE [N(A,  X,  e,  V,  n)  \X] 


=  0. 


In  these  cases,  we  can  think  of  SC  as  a  kind  of  average-case  analysis  of  these  algorithms.  How¬ 
ever,  as  we  discuss  next,  there  are  also  many  C  and  V  for  which  there  is  no  prior-independent 
algorithm  achieving  o(  1/e)  sample  complexity  for  all  priors.  Thus,  any  general  result  on  o(T/c) 
expected  sample  complexity  for  7r-dependent  algorithms  would  indicate  that  there  is  a  real  ad¬ 
vantage  to  having  access  to  the  prior,  beyond  the  apparent  smoothing  effects  of  an  average-case 
analysis. 

As  an  example  of  a  problem  where  no  prior-independent  self-verifying  algorithm  can  achieve 
o(l/e)  sample  complexity,  consider  X  =  [0, 1],  V  —  UniformQO,  1]),  and  C  as  the  concept  space 
of  interval  classifiers:  C  =  :  0  <  a  <  b  <  1},  where  lfa  h)  (x)  —  +1  if  x  e  (a,  b)  and 

—  1  otherwise.  Note  that  because  we  allow  a  =  b,  there  is  a  classifier  //_  e  C  labeling  all  of  X 
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negative.  For  0  <  a  <  b  <  1,  let  7q0y,  denote  the  prior  with  "(aj1)(lAfllhj})  =  1-  We  now  show 
any  correct  prior-independent  algorithm  has  Q(l/c)  sample  complexity  for  7T(0,o),  following  a 
technique  of  [Balcan,  Hanneke,  and  Vaughan,  2010].  Consider  any  £  G  (0,1/144)  and  any 
prior-independent  active  learning  algorithm  A  with  SC  (A,  £,  V,  7T(0,o))  <  s  =  y/y-.  Then  dehne 
He  =  {(12 ie,  12 (i  +  l)e)  :  i  G  |0, 1, . . . ,  [_1(L21_2£j  }}.  Let  /qa>6)  denote  the  classiher  returned 
by  A{e,V,  •)  when  queries  are  answered  with  X  =  1^,  for  0  <  a  <  b  <  1,  and  let  R(a,b) 
denote  the  set  of  examples  (x,  y)  for  which  A(e.  V ,  •)  requests  labels  (including  their  y  =  X(x) 
labels).  The  point  of  this  construction  is  that,  with  such  a  small  number  of  queries,  for  many 
of  the  (a,  b )  G  He,  the  algorithm  must  behave  identically  for  X  =  h)  as  for  X  =  l/j  0)  (i.e., 
R(a,b)  =  R( o,o)»  and  hence  h(a,b)  =  hy 0,o))-  These  H(a,b)  priors  will  then  witness  the  fact  that  A  is 
not  a  correct  self- verifying  algorithm.  Formally, 


max 

{a,b)&He 

1 

> 


\H, 


E  V{x  :h{a,b)(x)  ^i±  b)(x)) 


Y  E  V(x  :  h(a,b)(x)  ±  Ifa;6)(x)) 


( a,b)eHs 


-  mE 


>  - rE 


Y  V(X:  h(a,b)(X)  ±  ^,6)0)) 


R(o,0) 


\Hr 


Y  (12e  -  mm{V(x  :  h(a>b)(x)  ±  -1),  12e} 


\_(a,b)eH£:R(a^  —  -R(o,0) 


(6.1) 


Since  the  summation  in  (6.1)  is  restricted  to  (a,  b)  with  R(a,b)  =  R( o,o).  these  (a,  b)  must  also 
have  h(atb)  =  h{ o,o),  so  that  (6.1)  equals 


I HF 


rE 


Y  ( 12e  -  min {V(x  :  h{0fi)(x)  ±  -1),  12£} 


(6.2) 


Furthermore,  for  a  given  XlyX2, . . .  sequence,  the  only  (a,  b)  G  He  with  R(a,b)  ^  R{ o,o)  are 
those  for  which  some  (x,  —1)  G  R( 0,o)  has  x  G  (a,  6);  since  the  (a,  b)  G  He  are  disjoint,  the 
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above  summation  has  at  least  \He\  —  |-R(o,o)|  elements  in  it.  Thus,  (6.2)  is  at  least 

E  _  min|p(x  :  h{00)(x)  ±  -1),  12s}) 

>  E  I  [|i?(0,o) I  <  3s]  I  V(x  :  h^0)(x)  ±  -1)  <  6e  |3^  (12e  ~  fe) 

>  3dP  QjR(0,o)|  <  3 s,V(x  :  h(o,o)(x)  ^  -1)  <  6e)  .  (6.3) 


By  Markov’s  inequality, 


P(|i2(0,0)|  >  3s)  <  E[|i?(0,0)|]/(3s)  =  SC(A,e,V,nm)/{3s)  <  1/3, 

and  P  (t>(x  :  h(0,o)(x)  ^  —1)  >  <  E  V(x  :  h(0,o)(z)  7^  —  1)  /(6e),  and  if  A  is  a  correct 

self-verifying  algorithm,  then  E  V(x  :  h(0: 0)(x)  ^  —1)  /(6e)  <  1/6.  Thus,  by  a  union  bound, 
(6.3)  is  at  least  3e(l  —  1/3  —  1/6)  =  (3/2)e  >  e.  Therefore,  A  cannot  be  a  correct  self-verifying 
learning  algorithm. 


6.4  Prior-Dependent  Learning:  An  Example 

We  begin  our  exploration  of  7r-dependent  active  learning  with  a  concrete  example,  namely  inter¬ 
val  classifiers  under  a  uniform  data  density  but  arbitrary  prior,  to  illustrate  how  access  to  the  prior 
can  make  a  difference  in  the  sample  complexity.  Specifically,  consider  X  =  [0,1  ],  V  uniform 
on  [0, 1],  and  the  concept  space  C  of  interval  classifiers  specified  in  the  previous  section.  For 
each  classifier  h  G  C,  define  w(h)  =  T>(x  :  h(x)  =  +1)  (the  width  of  the  interval  h).  Note  that 
because  we  allow  a  —  b  in  the  definition  of  C,  there  is  a  classifier  h-  G  C  with  w(h-)  =  0. 

For  simplicity,  in  this  example  (only)  we  will  suppose  the  algorithm  may  request  the  label 
of  any  point  in  X,  not  just  those  in  the  sequence  {Xi};  the  same  ideas  can  easily  be  adapted 
to  the  setting  where  queries  are  restricted  to  {26;}.  Consider  an  active  learning  algorithm  that 
sequentially  requests  the  labels  X{x)  for  points  x  at  1/2,  1/4,  3/4,  1/8,  3/8,  5/8,  7/8,  1/16, 
3/16,  etc.,  until  (case  1)  it  encounters  an  example  x  with  X(x)  =  +1  or  until  (case  2)  the  set  of 
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classifiers  7CC  consistent  with  all  observed  labels  so  far  satisfies  E[w(AT)  \V]  <  £  (which  ever 
comes  first).  In  case  2,  the  algorithm  simply  halts  and  returns  the  constant  classifier  that  always 
predicts  —1:  call  it  //_;  note  that  er(/i_)  =  w(X).  In  case  1,  the  algorithm  enters  a  second  phase, 
in  which  it  performs  a  binary  search  (repeatedly  querying  the  midpoint  between  the  closest  two 
—  1  and  +1  points,  taking  0  and  1  as  known  negative  points)  to  the  left  and  right  of  the  observed 
positive  point,  halting  after  log2(4/e)  label  requests  on  each  side;  this  results  in  estimates  of  the 
target’s  endpoints  up  to  ±e/4,  so  that  returning  any  classifier  among  the  set  Vr  C  C  consistent 
with  these  labels  results  in  error  rate  at  most  e;  in  particular,  if  h  is  the  classifier  in  V  returned, 
then  E[er(/i)|V]  <  e. 

Denoting  this  algorithm  by  An,  and  h  the  classifier  it  returns,  we  have 

E  er  (ji^j  =  E  E  er  (ji'j  V  <  e, 
so  that  the  algorithm  is  definitely  correct. 

Note  that  case  2  will  definitely  be  satisfied  after  at  most  -  label  requests,  and  if  w(X)  >  e, 
then  case  1  will  definitely  be  satisfied  after  at  most  — label  requests,  so  that  the  algorithm  never 
makes  more  than  V)  ^  label  requests  before  satisfying  one  of  the  two  cases.  Abbreviating 

N( X)  —  N(A[],X,e,V,  7r),we  have 

E  [N(X)] 

=  E  \n(X)  w(X)  =  oj  P  (w(X)  =  0) 

+  E  \n(X)  0  <  w(X)  <  y/e\  P  (0  <  w(X)  <  y/e) 

+  E  \n(X)  w(X)  >  P  (w(X)  >  y/e) 

<  E  \n(X)  w(X)  =  ol  P  (w(X)  =  0)  +  -P  (0  <  w(X)  <  y/e)  +  ~^=  +  2  log2  -.  (6.4) 

J  £  y/£  £ 

The  third  and  fourth  terms  in  (6.4)  are  o(l/e).  Since  P(0  <  w( X)  <  y/e)  — >  0  as  £  — *  0,  the 
second  term  in  (6.4)  is  o(l/£)  as  well.  If  F(w(X)  =  0)  =  0,  this  completes  the  proof.  We  focus 
the  rest  of  the  proof  on  the  first  term  in  (6.4),  in  the  case  that  P(w(A)  =  0)  >  0:  i.e.,  there  is 
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nonzero  probability  that  the  target  X  labels  the  space  all  negative.  Letting  V  denote  the  subset 
of  C  consistent  with  all  requested  labels,  note  that  on  the  event  w(X)  =  0,  after  n  label  requests 
(for  n  +  la  power  of  2)  we  have  max/)ey  w(h)  <  1/n.  Thus,  for  any  value  7  G  (0, 1),  after  at 
most  -  label  requests,  on  the  event  that  w(X)  =  0, 


E 


w(X) 


V 


E[w(X)I[w(A)  <  7]]  E[io(A)I  [to( A)  <  7]] 


MV) 


(w(X)  =  0) 


(6.5) 


Now  note  that,  by  the  dominated  convergence  theorem, 


lim  E 

7— >0 


w(x)i  [w{x)  <  7; 


7 


=  E 


lim 

7— >0 


w(x) i  Hx)  <  7 


7 


=  0. 


Therefore,  E  [w(X)I  \w(X)  <  7]]  =  0(7).  If  we  define  y£  as  the  largest  value  of  7  for  which 
E  [iu(X)I  [w(X)  <  7]]  <  e¥(w(X)  =  0)  (or,  say,  half  the  supremum  if  the  maximum  is  not 
achieved),  then  we  have  S>  e.  Combined  with  (6.5),  this  implies 


E 


N(X)  w(X)  =  0  <  —  =  of  1/e). 

J  Is 


Thus,  all  of  the  terms  in  (6.4)  are  o(l/e),  so  that  in  total  E[iV(X)]  =  o(l/e). 

In  conclusion,  for  this  concept  space  C  and  data  distribution  D,  we  have  a  correct  active 
learning  algorithm  A  achieving  a  sample  complexity  SC  (A,  £,  V.  n)  =  o(  1/e)  for  all  priors  tt 
on  C. 


6.5  A  General  Result  for  Self- Verifying  Bayesian  Active  Learn¬ 

ing 

In  this  section,  we  present  our  main  result  for  improvements  achievable  by  prior-dependent 
self-verifying  active  learning:  a  general  result  stating  that  o(  1/e)  expected  sample  complexity 
is  always  achievable  for  some  appropriate  prior-dependent  active  learning  algorithm,  for  any 
(X,  C.  V.  7 r)  for  which  C  has  finite  VC  dimension.  Since  the  known  results  for  the  sample  com¬ 
plexity  of  passive  learning  with  access  to  the  prior  are  typically  ©(1/e)  [Haussler,  Kearns,  and 
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Schapire,  1994a],  and  since  there  are  known  learning  problems  (X,  C,  V.  i r)  for  which  every  pas¬ 
sive  learning  algorithm  requires  f2(  1/e)  samples,  this  o(  l/c)  result  for  active  learning  represents 
an  improvement  over  passive  learning. 

The  proof  is  simple  and  accessible,  yet  represents  an  important  step  in  understanding  the 
problem  of  self-termination  in  active  learning  algorithms,  and  the  general  issue  of  the  complexity 
of  verification.  Also,  since  there  are  problems  ( X ,  C,  V)  where  C  has  finite  VC  dimension  but 
for  which  no  prior-independent  correct  active  learning  algorithm  (of  the  self-terminating  type 
studied  here)  can  achieve  o(  1/e)  expected  sample  complexity  for  every  7 r,  this  also  represents  a 
significant  step  toward  understanding  the  inherent  value  of  having  access  to  the  prior  in  active 
learning. 

First,  we  have  a  small  lemma. 

Lemma  6.1.  For  any  sequence  of  functions  (f)n  :  C  — »  [0,  00)  such  that,  V/  G  C,  4>n{f)  =  o(l/n) 
and  Vn  G  N,  fn(f)  <  c/n  (for  an  f  -independent  constant  c  G  (0,  00) ),  there  exists  a  sequence 
(j)n  in  [0,  00)  such  that 


fn  =  oil  In)  and  lim  P  (<j)n(  X)  >  (pri)  =  0. 

n— >■  00  v  7 

Proof  For  any  constant  7  G  (0,  00),  we  have  (by  Markov’s  inequality  and  the  dominated  con¬ 
vergence  theorem) 


lim  P  (n(f)n(X)  >  7)  < 

n— >■  00 


—  lim  E[nfn(X)] 

'y  n— »■  00 

— E  lim  n<j)n( X) 

-y  Ln^-oo 


0. 


Therefore  (by  induction),  there  exists  a  diverging  sequence  77  in  N  such  that 


lim  sup  P  (n<pn(X)  >2  *)  =  0. 

1— K50  n>ni 

Inverting  this,  let  in  =  max{i  G  N  :  nl  <  n},  and  define  fn(X)  =  (1/n)  •  2~ln.  By  construction, 
P  {(f)n(X)  >  fn)  —>  0.  Furthermore,  nt  —>■  00  in  —>  00,  so  that  we  have 

lim  nfn  =  lim  2~ln  =  0, 

n^-oo  n— >  00 
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implying  fn  =  o(l/n). 


□ 


Theorem  6.2.  For  any  VC  class  C,  there  is  a  correct  active  learning  algorithm  Aa  that,  for  every 
data  distribution  V  and  prior  i r,  achieves  expected  sample  complexity 

SC(Aai£,V,7i)  =  o(  l/e). 

Our  approach  to  proving  Theorem  6.2  is  via  a  reduction  to  established  results  about  (prior- 
independent)  active  learning  algorithms  that  are  not  self-verifying.  Specifically,  consider  a 
slightly  different  type  of  active  learning  algorithm  than  that  defined  above:  namely,  an  algo¬ 
rithm  Ab  that  takes  as  input  a  budget  n  e  Non  the  number  of  label  requests  it  is  allowed  to 
make,  and  that  after  making  at  most  n  label  requests  returns  as  output  a  classifier  hn.  Let  us  refer 
to  any  such  algorithm  as  a  budget-based  active  learning  algorithm.  Note  that  budget-based  active 
learning  algorithms  are  prior- independent  (have  no  direct  access  to  the  prior).  The  following  re¬ 
sult  was  proven  by  [Hanneke,  2009]  (see  also  the  related  earlier  work  of  [Balcan,  Hanneke,  and 
Vaughan,  2010]). 

Lemma  6.3.  [Hanneke,  2009]  For  any  VC  class  C,  there  exists  a  constant  c  G  (0,  oo),  a  function 
£(ri:  /,  V),  and  a  budget-based  active  learning  algorithm  A,  such  that 

VP,  V/  G  C ,  £(n;  /,  V)  <  c/n  and  £(n\  f,  V)  =  o(l/n), 

and  E  er(Vl&(n))  X  <  £(n;  X.V)  (always).2 

That  is,  equivalently,  for  any  fixed  value  for  the  target  function,  the  expected  error  rate  is 
o(l/n),  where  the  random  variable  in  the  expectation  is  only  the  data  sequence  Xi,  X2, . . ..  Our 
task  in  the  proof  of  Theorem  6.2  is  to  convert  such  a  budget-based  algorithm  into  one  that  is 
correct,  self-terminating,  and  prior-dependent,  taking  e  as  input. 

Theorem  6.2.  Consider  Ab,  £,  and  c  as  in  Lemma  6.3,  let  hn  denote  the  classifier  returned  by 
Ab{n),  and  define 

n-K,e  =  min  |n  G  N  :  E  er  (hn'j  <  £  j  . 

2Furthermore,  it  is  not  difficult  to  see  that  we  can  take  this  £  to  be  measurable  in  the  X  argument. 
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This  value  is  accessible  based  purely  on  access  to  n  and  V.  Furthermore,  we  clearly  have  (by 
construction)  E  er  (jiHn  <  e.  Thus,  letting  A„  denote  the  active  learning  algorithm  taking 
(U,ir,£)  as  input,  which  runs  Ab(nnj£)  and  then  returns  hnne,  we  have  that  Aa  is  a  correct 
learning  algorithm  (i.e.,  its  expected  error  rate  is  at  most  e). 

As  for  the  expected  sample  complexity  SC (Ar, ,  e,  V.  7r)  achieved  by  Aa,  we  have  SC (A„.:  e,  V,  n)  < 
n7 T)£,  so  that  it  remains  only  to  bound  nIi£.  By  Lemma  6.1,  there  is  a  7r-dependent  function 
£  (n;  7r,  V)  such  that 

7r({/  G  C  :  £{n\  f,V)  >  S{n\Tt,V)})  ->  0 
and  £(n;  n.V)  =  o(l/n). 

Therefore,  by  the  law  of  total  expectation, 

E  er  (jiri'j  =E  E  er  X  <  E  [£{n\ X:  T>)\ 

<  — 7T  ({/  G  C  :  £{n\  f,V)  >  £(n]ir,V)})  +  £(n]  n,V) 
n 

=  o(l/n). 

If  nnt£  =  0(1),  then  clearly  =  o(  1/e)  as  needed.  Otherwise,  since  is  monotonic  in  e, 
we  must  have  f  °o  as  e  0.  In  particular,  in  this  latter  case  we  have 


Theorem  6.2  implies  that,  if  we  have  direct  access  to  the  prior  distribution  of  A",  regardless  of 
what  that  prior  distribution  ^ r  is,  we  can  always  construct  a  self-verifying  active  learning  algorithm 
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Aa  that  has  a  guarantee  of  E  [er  (Aa{s,  D,  7r))]  <  £  and  its  expected  number  of  label  requests 
is  o(l/e).  This  guarantee  is  not  possible  for  prior-independent  self-verifying  active  learning 
algorithms. 


6.6  Dependence  on  V  in  the  Learning  Algorithm 

The  dependence  on  V  in  the  algorithm  described  in  the  proof  of  Theorem  6.2  is  fairly  weak,  and 
we  can  eliminate  any  direct  dependence  on  V  by  replacing  er  (h^j  by  a  l—e/2  confidence  upper 
bound  based  on  Me  —  Q,  log  2)  i.i.d.  unlabeled  examples  X\ .  X'2. . . . ,  X'M  independent  from 
the  examples  used  by  the  algorithm  (e.g.,  set  aside  in  a  pre-processing  step,  where  the  bound  is 
calculated  via  Hoeffding’s  inequality  and  a  union  bound  over  the  values  of  n  that  we  check, 
of  which  there  are  at  most  0(l/e)).  Then  we  simply  increase  the  value  of  n  (starting  at  some 
constant,  such  as  1)  until 

1  Me 

—  ({/  e  C  :  /  (A',')  *  hn  (A')})  <  s/2 . 

£  i=  1 

The  expected  value  of  the  smallest  value  of  n  for  which  this  occurs  is  o(l/e).  Note  that  this 
only  requires  access  to  the  prior  n,  not  the  data  distribution  V  (the  budget-based  algorithm  A, 
of  [Hanneke,  2009]  has  no  direct  dependence  on  V);  if  desired  for  computational  efficiency,  this 
dependence  may  also  be  estimated  by  a  1  —  e/4  confidence  upper  bound  based  on  fl  log 
independent  samples  of  A"  values  with  distribution  n,  where  for  each  sample  we  simulate  the 
execution  of  At  (n)  for  that  (simulated)  target  function  in  order  to  obtain  the  returned  classifier. 
In  particular,  note  that  no  actual  label  requests  to  the  oracle  are  required  during  this  process  of 
estimating  the  appropriate  label  budget  nnt£,  as  all  executions  of  Ai,  are  simulated. 
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6.7  Inherent  Dependence  on  n  in  the  Sample  Complexity 


We  have  shown  that  for  every  prior  n,  the  sample  complexity  is  bounded  by  a  o{l/e)  function. 
One  might  wonder  whether  it  is  possible  that  the  asymptotic  dependence  on  £  in  the  sample 
complexity  can  be  prior-independent,  while  still  being  o(l/e).  That  is,  we  can  ask  whether 
there  exists  a  (7r-independent)  function  s(e)  =  o(l/c)  such  that,  for  every  n,  there  is  a  correct 
7r-dependent  algorithm  A  achieving  a  sample  complexity  SC(A,£,V,tt)  =  0(s(e)),  possibly 
involving  7r-dependent  constants.  Certainly  in  some  cases,  such  as  threshold  classifiers,  this  is 
true.  However,  it  seems  this  is  not  generally  the  case,  and  in  particular  it  fails  to  hold  for  the 
space  of  interval  classifiers. 

For  instance,  consider  a  prior  n  on  the  space  C  of  interval  classifiers,  constructed  as  follows. 
We  are  given  an  arbitrary  monotonic  g(e)  =  o(l/c);  since  g{e)  =  o(l/c),  there  must  exist 
(nonzero)  functions  qi{i)  and  q2(i )  such  that  lim^oo  qi(i)  =  0,  lim^oo  q2{i)  =  0,  and  Vi  G 
N,  g(qi(i)/2l+1)  <  q2{i)  ■  2l\  furthermore,  letting  q(i)  =  max{gi(i),  g2(i)},  by  monotonicity  of 
g  we  also  have  Vi  e  N,  g(q(i)/2t+1)  <  q(i)  ■  2\  and  lim^oo  q(i)  =  0.  Then  define  a  function 
p(i)  with  =  1  suc^  that  p(i)  >  q{i)  for  infinitely  many  i  e  N;  for  instance,  this  can 

be  done  inductively  as  follows.  Let  a0  =  1/2;  for  each  i  e  N,  if  q{i)  >  «i_i,  set  p(i)  =  0 
and  a*  =  «i_i;  otherwise,  setp(i)  =  a;_i  and  =  «j_i/2.  Finally,  for  each  i  e  N,  and  each 
3  e  (0,1,...,  2* -1},  define  7r^|lJ.2_4i(j.+1).2_4)})  =p{i)/2i. 

We  let  D  be  uniform  on  X  =  [0, 1].  Then  for  each  i  e  N  s.t.  p(i)  >  q(i),  there  is  a 
p(i)  probability  the  target  interval  has  width  2~\  and  given  this  any  algorithm  requires  oc  2* 
expected  number  of  requests  to  determine  which  of  these  2*  intervals  is  the  target,  failing  which 
the  error  rate  is  at  least  2~\  In  particular,  letting  et  =  p(i)/ 2*+1,  any  correct  algorithm  has  sample 
complexity  at  least  oc  p(i)  ■  2*  fore  =  £*.  Noting  p(i)  ■  2*  >  q(i)  ■  2*  >  g(q(i)/2l+1)  >  g(£i),  this 
implies  there  exist  arbitrarily  small  values  of  £  >  0  for  which  the  optimal  sample  complexity  is 
at  least  oc  g(s),  so  that  the  sample  complexity  is  not  o(g(e)). 

For  any  s(e)  =  o(l/e),  there  exists  a  monotonic  g{e)  =  o(  1/e)  such  that  s(e)  =  o(g(e)). 
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Thus,  constructing  n  as  above  for  this  g,  we  have  that  the  sample  complexity  is  not  o(g(e)), 
and  therefore  not  0(s(e)).  So  at  least  for  the  space  of  interval  classifiers,  the  specific  oil /a) 
asymptotic  dependence  on  £  is  inherently  7r-dependent.  This  argument  also  illustrates  that  the 
o(  1/e)  result  in  Theorem  6.2  is  essentially  the  strongest  possible  at  this  level  of  generality  (i.e., 
without  saying  more  about  C,  V,  or  7 r). 
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Chapter  7 


Prior  Estimation  for  Transfer  Learning 


Abstract 

1  We  explore  a  transfer  learning  setting,  in  which  a  finite  sequence  of  target  concepts  are  sampled 
independently  with  an  unknown  distribution  from  a  known  family.  We  study  the  total  number  of 
labeled  examples  required  to  learn  all  targets  to  an  arbitrary  specified  expected  accuracy,  focusing 
on  the  asymptotics  in  the  number  of  tasks  and  the  desired  accuracy.  Our  primary  interest  is 
formally  understanding  the  fundamental  benefits  of  transfer  learning,  compared  to  learning  each 
target  independently  from  the  others.  Our  approach  to  the  transfer  problem  is  general,  in  the 
sense  that  it  can  be  used  with  a  variety  of  learning  protocols. 


7.1  Introduction 

Transfer  learning  reuses  knowledge  from  past  related  tasks  to  ease  the  process  of  learning  to 
perform  a  new  task.  The  goal  of  transfer  learning  is  to  leverage  previous  learning  and  experience 
to  more  efficiently  learn  novel,  but  related,  concepts,  compared  to  what  would  be  possible  with¬ 
out  this  prior  experience.  The  utility  of  transfer  learning  is  typically  measured  by  a  reduction  in 
'joint  work  with  Jaime  Carbonell  and  Steve  Hanneke 
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the  number  of  training  examples  required  to  achieve  a  target  performance  on  a  sequence  of  re¬ 
lated  learning  problems,  compared  to  the  number  required  for  unrelated  problems:  i.e.,  reduced 
sample  complexity.  In  many  real-life  scenarios,  just  a  few  training  examples  of  a  new  concept 
or  process  is  often  sufficient  for  a  human  learner  to  grasp  the  new  concept  given  knowledge  of 
related  ones.  For  example,  learning  to  drive  a  van  becomes  much  easier  a  task  if  we  have  already 
learned  how  to  drive  a  car.  Learning  French  is  somewhat  easier  if  we  have  already  learned  En¬ 
glish  (vs  Chinese),  and  learning  Spanish  is  easier  if  we  know  Portuguese  (vs  German).  We  are 
therefore  interested  in  understanding  the  conditions  that  enable  a  learning  machine  to  leverage 
abstract  knowledge  obtained  as  a  by-product  of  learning  past  concepts,  to  improve  its  perfor¬ 
mance  on  future  learning  problems.  Furthermore,  we  are  interested  in  how  the  magnitude  of 
these  improvements  grows  as  the  learning  system  gains  more  experience  from  learning  multiple 
related  concepts. 

The  ability  to  transfer  knowledge  gained  from  previous  tasks  to  make  it  easier  to  learn  a  new 
task  can  potentially  benefit  a  wide  range  of  real-world  applications,  including  computer  vision, 
natural  language  processing,  cognitive  science  (e.g.,  fMRI  brain  state  classification),  and  speech 
recognition,  to  name  a  few.  As  an  example,  consider  training  a  speech  recognizer.  After  training 
on  a  number  of  individuals,  a  learning  system  can  identify  common  patterns  of  speech,  such  as 
accents  or  dialects,  each  of  which  requires  a  slightly  different  speech  recognizer;  then,  given  a 
new  person  to  train  a  recognizer  for,  it  can  quickly  determine  the  particular  dialect  from  only  a 
few  well-chosen  examples,  and  use  the  previously-learned  recognizer  for  that  particular  dialect. 
In  this  case,  we  can  think  of  the  transferred  knowledge  as  consisting  of  the  common  aspects  of 
each  recognizer  variant  and  more  generally  the  distribution  of  speech  patterns  existing  in  the 
population  these  subjects  are  from.  This  same  type  of  distribution-related  knowledge  transfer 
can  be  helpful  in  a  host  of  applications,  including  all  those  mentioned  above. 

Supposing  these  target  concepts  (e.g.,  speech  patterns)  are  sampled  independently  from  a 
fixed  population,  having  knowledge  of  the  distribution  of  concepts  in  the  population  may  often 
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be  quite  valuable.  More  generally,  we  may  consider  a  general  scenario  in  which  the  target  con¬ 
cepts  are  sampled  i.i.d.  according  to  a  fixed  distribution.  As  we  show  below,  the  number  of 
labeled  examples  required  to  learn  a  target  concept  sampled  according  to  this  distribution  may 
be  dramatically  reduced  if  we  have  direct  knowledge  of  the  distribution.  However,  since  in  many 
real-world  learning  scenarios,  we  do  not  have  direct  access  to  this  distribution,  it  is  desirable  to  be 
able  to  somehow  learn  the  distribution,  based  on  observations  from  a  sequence  of  learning  prob¬ 
lems  with  target  concepts  sampled  according  to  that  distribution.  The  hope  is  that  an  estimate 
of  the  distribution  so-obtained  might  be  almost  as  useful  as  direct  access  to  the  true  distribution 
in  reducing  the  number  of  labeled  examples  required  to  learn  subsequent  target  concepts.  The 
focus  of  this  paper  is  an  approach  to  transfer  learning  based  on  estimating  the  distribution  of 
the  target  concepts.  Whereas  we  acknowledge  that  there  are  other  important  challenges  in  trans¬ 
fer  learning,  such  as  exploring  improvements  obtainable  from  transfer  under  various  alternative 
notions  of  task  relatedness  [Ben-David  and  Schuller,  2003,  Evgeniou  and  Pontil,  2004],  or  alter¬ 
native  reuses  of  knowledge  obtained  from  previous  tasks  [Thrun,  1996],  we  believe  that  learning 
the  distribution  of  target  concepts  is  a  central  and  crucial  component  in  many  transfer  learning 
scenarios,  and  can  reduce  the  total  sample  complexity  across  tasks. 

Note  that  it  is  not  immediately  obvious  that  the  distribution  of  targets  can  even  be  learned 
in  this  context,  since  we  do  not  have  direct  access  to  the  target  concepts  sampled  according  to 
it,  but  rather  have  only  indirect  access  via  a  finite  number  of  labeled  examples  for  each  task;  a 
significant  part  of  the  present  work  focuses  on  establishing  that  as  long  as  these  finite  labeled 
samples  are  larger  than  a  certain  size,  they  hold  sufficient  information  about  the  distribution  over 
concepts  for  estimation  to  be  possible.  In  particular,  in  contrast  to  standard  results  on  consistent 
density  estimation,  our  estimators  are  not  directly  based  on  the  target  concepts,  but  rather  are 
only  indirectly  dependent  on  these  via  the  labels  of  a  finite  number  of  data  points  from  each 
task.  One  desideratum  we  pay  particular  attention  to  is  minimizing  the  number  of  extra  labeled 
examples  needed  for  each  task,  beyond  what  is  needed  for  learning  that  particular  target,  so  that 
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the  benefits  of  transfer  learning  are  obtained  almost  as  a  by-product  of  learning  the  targets.  Our 
technique  is  general,  in  that  it  applies  to  any  concept  space  with  finite  VC  dimension;  also,  the 
process  of  learning  the  target  concepts  is  (in  some  sense)  decoupled  from  the  mechanism  of 
learning  the  concept  distribution,  so  that  we  may  apply  our  technique  to  a  variety  of  learning 
protocols,  including  passive  supervised  learning,  active  supervised  learning,  semi-supervised 
learning,  and  learning  with  certain  general  data-dependent  forms  of  interaction  [Hanneke,  2009]. 
For  simplicity,  we  choose  to  formulate  our  transfer  learning  algorithms  in  the  language  of  active 
learning;  as  we  show,  this  problem  can  benefit  significantly  from  transfer.  Formulations  for  other 
learning  protocols  would  follow  along  similar  lines,  with  analogous  theorems;  these  results  are 
particularly  interested  when  composed  with  the  results  on  prior-dependent  active  learning  from 
the  previous  chapter. 

Transfer  learning  is  related  at  least  in  spirit  to  much  earlier  work  on  case-based  and  analog¬ 
ical  learning  [Carbonell,  1983,  1986,  Kolodner  (Ed),  1993,  Thrun,  1996,  Veloso  and  Carbonell, 
1993],  although  that  body  of  work  predated  modem  machine  learning,  and  focused  on  symbolic 
reuse  of  past  problem  solving  solutions  rather  than  on  current  machine  learning  problems  such  as 
classification,  regression  or  structured  learning.  More  recently,  transfer  learning  (and  the  closely 
related  problem  of  multitask  learning)  has  been  studied  in  specific  cases  with  interesting  (though 
sometimes  heuristic)  approaches  [Baxter,  1997,  Ben-David  and  Schuller,  2003,  Caruana,  1997, 
Micchelli  and  Pontil,  2004,  Silver,  2000].  This  paper  considers  a  general  theoretical  framework 
for  transfer  learning,  based  on  an  Empirical  Bayes  perspective,  and  derives  rigorous  theoretical 
results  on  the  benefits  of  transfer.  We  discuss  the  relation  of  this  analysis  to  existing  theoretical 
work  on  transfer  learning  below. 

7.1.1  Outline  of  the  paper 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  7.2  we  introduce  basic  notation 
used  throughout,  and  survey  some  related  work  from  the  existing  literature.  In  Section  7.3,  we 


111 


describe  and  analyze  our  proposed  method  for  estimating  the  distribution  of  target  concepts,  the 
key  ingredient  in  our  approach  to  transfer  learning,  which  we  then  present  in  Section  7.4. 

7.2  Definitions  and  Related  Work 

First,  we  state  a  few  basic  notational  conventions.  We  denote  N  =  {1,  2, . . .}  and  N0  =  N  U 
{0}.  For  any  random  variable  X,  we  generally  denote  by  Px  the  distribution  of  X  (the  induced 
probability  measure  on  the  range  of  X),  and  by  Pxiy  the  regular  conditional  distribution  of  X 
given  Y.  For  any  pair  of  probability  measures  //| .  p2  on  a  measurable  space  (Q.  T),  we  define 

ll/L  -  /J2||  =  sup  \fJ.i(A)  -  p2{A) |. 

AaT 

Next  we  define  the  particular  objects  of  interest  to  our  present  discussion.  Let  0  be  an 
arbitrary  set  (called  the  parameter  space),  (X,  Bx)  be  a  Borel  space  [Schervish,  1995]  (where 
X  is  called  the  instance  space),  and  V  be  a  fixed  distribution  on  X  (called  the  data  distribution). 
For  instance,  0  could  be  M”  and  X  could  be  Wn,  for  some  n,  m  £  N,  though  more  general 
scenarios  are  certainly  possible  as  well,  including  infinite-dimensional  parameter  spaces.  Let  C 
be  a  set  of  measurable  classifiers  h  :  X  — >■  {  —  1,  +1}  (called  the  concept  space),  and  suppose 
C  has  VC  dimension  d  <  oo  [Vapnik,  1982]  (such  a  space  is  called  a  VC  class).  C  is  equipped 
with  its  Borel  u-algebra  B,  induced  by  the  pseudo-metric  p(h,  g )  =  V({x  £  X  :  h(x)  ^  g(x)}). 
Though  all  of  our  results  can  be  formulated  for  general  V  in  slightly  more  complex  terms,  for 
simplicity  throughout  the  discussion  below  we  suppose  p  is  actually  a  metric,  in  that  any  h,  g  £  C 
with  h  ^  g  have  p(h,g )  >  0;  this  amounts  to  a  topological  assumption  on  C  relative  to  V. 

For  each  6  £  0,  ng  is  a  distribution  on  C  (called  a  prior).  Our  only  (rather  mild)  assumption 
on  this  family  of  prior  distributions  is  that  {no  :  9  £  0}  be  totally  bounded,  in  the  sense  that 
Me  >  0,  3  finite  0£C0  s.t.  V9  £  0,  3 0£  £  0e  with  ||7re  —  7174 1|  <  e.  See  [Devroye  and  Lugosi, 
2001]  for  examples  of  categories  of  classes  that  satisfy  this. 

The  general  setup  for  the  learning  problem  is  that  we  have  a  true  parameter  value  0*  £  0,  and 
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a  collection  of  C-valued  random  variables  {h*e}te where  for  a  fixed  9  e  (-)  the  {h*e }teN 
variables  are  i.i.d.  with  distribution  ne. 

The  learning  problem  is  the  following.  For  each  9  £  0,  there  is  a  sequence 

2t(0)  =  {(xtl,ytl(0)),(xt2,yt2(0)),...}, 

where  are  i.i.d.  V,  and  for  each  G  N,  1^(0)  =  h*g(Xu).  For  fceNwe  denote  by 

Ztk{0)  =  {(Xn,  Yn(9)), (Xtk,  Ytk(9))}.  Since  the  Yti{6 )  are  the  actual  h*t0(Xti)  values,  we 
are  studying  the  non-noisy,  or  realizable-case ,  setting. 

The  algorithm  receives  values  e  and  T  as  input,  and  for  each  t  e  {1,  2, . . . ,  T}  in  increas¬ 
ing  order,  it  observes  the  sequence  Xn ,  Xt2, . . .,  and  may  then  select  an  index  i\,  receive  label 
Yti i (0*),  select  another  index  i2,  receive  label  Ytl2(9+),  etc.  The  algorithm  proceeds  in  this  fash¬ 
ion,  sequentially  requesting  labels,  until  eventually  it  produces  a  classifier  ht.  It  then  increments 
t  and  repeats  this  process  until  it  produces  a  sequence  hi,  h2, . . . ,  hT,  at  which  time  it  halts.  To  be 
called  correct ,  the  algorithm  must  have  a  guarantee  that  V0*  e  0,  Vt  <  T,  E  p  (fit,  h*g^j  <  e, 
for  any  values  of  T  e  N  and  £  >  0  given  as  input.  We  will  be  interested  in  the  expected  number 
of  label  requests  necessary  for  a  correct  learning  algorithm,  averaged  over  the  T  tasks,  and  in 
particular  in  how  shared  information  between  tasks  can  help  to  reduce  this  quantity  when  direct 
access  to  9 *  is  not  available  to  the  algorithm. 

7.2.1  Relation  to  Existing  Theoretical  Work  on  Transfer  Learning 

Although  we  know  of  no  existing  work  on  the  theoretical  advantages  of  transfer  learning  for 
active  learning,  the  existing  literature  contains  several  analyses  of  the  advantages  of  transfer 
learning  for  passive  learning.  In  his  classic  work,  Baxter  ([Baxter,  1997]  section  4)  explores  a 
similar  setup  for  a  general  form  of  passive  learning,  except  in  a  full  Bayesian  setting  (in  contrast 
to  our  setting,  often  referred  to  as  “empirical  Bayes,”  which  includes  a  constant  parameter  0*  to  be 
estimated  from  data).  Essentially,  [Baxter,  1997]  sets  up  a  hierarchical  Bayesian  model,  in  which 
(in  our  notation)  0*  is  a  random  variable  with  known  distribution  (hyper-prior),  but  otherwise  the 
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specialization  of  Baxter’s  setting  to  the  pattern  recognition  problem  is  essentially  identical  to  our 
setup  above.  This  hyper-prior  does  make  the  problem  slightly  easier,  but  generally  the  results 
of  [Baxter,  1997]  are  of  a  different  nature  than  our  objectives  here.  Specifically,  Baxter’s  results 
on  learning  from  labeled  examples  can  be  interpreted  as  indicating  that  transfer  learning  can 
improve  certain  constant  factors  in  the  asymptotic  rate  of  convergence  of  the  average  of  expected 
error  rates  across  the  learning  problems.  That  is,  certain  constant  complexity  terms  (for  instance, 
related  to  the  concept  space)  can  be  reduced  to  (potentially  much  smaller)  values  related  to  tt^  by 
transfer  learning.  Baxter  argues  that,  as  the  number  of  tasks  grows  large,  this  effectively  achieves 
close  to  the  known  results  on  the  sample  complexity  of  passive  learning  with  direct  access  to  0*. 
A  similar  claim  is  discussed  by  Ando  and  Zhang  [Ando  and  Zhang,  2004]  (though  in  less  detail) 
for  a  setting  closer  to  that  studied  here,  where  9+  is  an  unknown  parameter  to  be  estimated. 

There  are  also  several  results  on  transfer  learning  of  a  slightly  different  variety,  in  which, 
rather  than  having  a  prior  distribution  for  the  target  concept,  the  learner  initially  has  several 
potential  concept  spaces  to  choose  from,  and  the  role  of  transfer  is  to  help  the  learner  select  from 
among  these  concept  spaces  [Ando  and  Zhang,  2005,  Baxter,  2000].  In  this  case,  the  idea  is 
that  one  of  these  concept  spaces  has  the  best  average  minimum  achievable  error  rate  per  learning 
problem,  and  the  objective  of  transfer  learning  is  to  perform  nearly  as  well  as  if  we  knew  which 
of  the  spaces  has  this  property.  In  particular,  if  we  assume  the  target  functions  for  each  task  all 
reside  in  one  of  the  concept  spaces,  then  the  objective  of  transfer  learning  is  to  perform  nearly 
as  well  as  if  we  knew  which  of  the  spaces  contains  the  targets.  Thus,  transfer  learning  results 
in  a  sample  complexity  related  to  the  number  of  learning  problems,  a  complexity  term  for  this 
best  concept  space,  and  a  complexity  term  related  to  the  diversity  of  concept  spaces  we  have  to 
choose  from.  In  particular,  as  with  [Baxter,  1997],  these  results  can  typically  be  interpreted  as 
giving  constant  factor  improvements  from  transfer  in  a  passive  learning  context,  at  best  reducing 
the  complexity  constants,  from  those  for  the  union  over  the  given  concept  spaces,  down  to  the 
complexity  constants  of  the  single  best  concept  space. 
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In  addition  to  the  above  works,  there  are  several  analyses  of  transfer  learning  and  multitask 
learning  of  an  entirely  different  nature  than  our  present  discussion,  in  that  the  objectives  of  the 
analysis  are  somewhat  different.  Specifically,  there  is  a  branch  of  the  literature  concerned  with 
task  relatedness,  not  in  terms  of  the  underlying  process  that  generates  the  target  concepts,  but 
rather  directly  in  terms  of  relations  between  the  target  concepts  themselves.  In  this  sense,  several 
tasks  with  related  target  concepts  should  be  much  easier  to  learn  than  tasks  with  unrelated  target 
concepts.  This  is  studied  in  the  context  of  kernel  methods  by  [Evgeniou  and  Pontil,  2004,  Evge- 
niou,  Micchelli,  and  Pontil,  2005,  Micchelli  and  Pontil,  2004],  and  in  a  more  general  theoretical 
framework  by  [Ben-David  and  Schuller,  2003].  As  mentioned,  our  approach  to  transfer  learning 
is  based  on  the  idea  of  estimating  the  distribution  of  target  concepts.  As  such,  though  interesting 
and  important,  these  notions  of  direct  relatedness  of  target  concepts  are  not  as  relevant  to  our 
present  discussion. 

As  with  [Baxter,  1997],  the  present  work  is  interested  in  showing  that  as  the  number  of 
tasks  grows  large,  we  can  effectively  achieve  a  sample  complexity  close  to  that  achievable  with 
direct  access  to  However,  in  contrast,  we  are  interested  in  a  general  approach  to  transfer 
learning  and  the  analysis  thereof,  leading  to  concrete  results  for  a  variety  of  learning  protocols 
such  as  active  learning  and  semi-supervised  learning.  In  particular,  our  analysis  of  active  learning 
reveals  the  interesting  phenomenon  that  transfer  learning  can  sometimes  improve  the  asymptotic 
dependence  on  e,  rather  than  merely  the  constant  factors  as  in  the  analysis  of  [Baxter,  1997]. 

Our  work  contrasts  with  [Baxter,  1997]  in  another  important  respect,  which  significantly 
changes  the  way  we  approach  the  problem.  Specifically,  in  Baxter’s  analysis,  the  results  (e.g., 
[Baxter,  1997]  Theorems  4,  6)  regard  the  average  loss  over  the  tasks,  and  are  stated  as  a  function 
of  the  number  of  samples  per  task.  This  number  of  samples  plays  a  dual  role  in  Baxter’s  analysis, 
since  these  samples  are  used  both  by  the  individual  learning  algorithm  for  each  task,  and  also  for 
the  global  transfer  learning  process  that  provides  the  learners  with  information  about  0*.  Baxter 
is  then  naturally  interested  in  the  rates  at  which  these  losses  shrink  as  the  sample  sizes  grow 
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large,  and  therefore  formulates  the  results  in  terms  of  the  asymptotic  behavior  as  the  per-task 
sample  sizes  grow  large.  In  particular,  the  results  of  [Baxter,  1997]  involve  residual  terms  which 
become  negligible  for  large  sample  sizes,  but  may  be  more  significant  for  smaller  sample  sizes. 

In  our  work,  we  are  interested  in  decoupling  these  two  roles  for  the  sample  sizes;  in  partic¬ 
ular,  our  results  regard  only  the  number  of  tasks  as  an  asymptotic  variable,  while  the  number  of 
samples  per  task  remains  bounded.  First,  we  note  a  very  practical  motivation  for  this:  namely, 
non-altruistic  learners.  In  many  settings  where  transfer  learning  may  be  useful,  it  is  desirable 
that  the  number  of  labeled  examples  we  need  to  collect  from  each  particular  learning  problem 
never  be  significantly  larger  than  the  number  of  such  examples  required  to  solve  that  particular 
problem  (i.e.,  to  learn  that  target  concept  to  the  desired  accuracy).  For  instance,  this  is  the  case 
when  the  learning  problems  are  not  all  solved  by  the  same  individual  (or  company,  etc.),  but 
rather  a  coalition  of  cooperating  individuals  (e.g.,  hospitals  sharing  data  on  clinical  trials);  each 
individual  may  be  willing  to  share  the  data  they  used  to  learn  their  particular  concept,  in  the 
interest  of  making  others’  learning  problems  easier;  however,  they  may  not  be  willing  to  collect 
significantly  more  data  than  they  themselves  need  for  their  own  learning  problem.  We  should 
therefore  be  particularly  interested  in  studying  transfer  as  a  by-product  of  the  usual  learning  pro¬ 
cess;  failing  this,  we  are  interested  in  the  minimum  possible  number  of  extra  labeled  examples 
per  task  to  gain  the  benefits  of  transfer  learning. 

The  issue  of  non-altruistic  learners  also  presents  a  further  technical  problem  in  that  the  in¬ 
dividuals  solving  each  task  may  be  unwilling  to  alter  their  method  of  gathering  data  to  be  more 
informative  for  the  transfer  learning  process.  That  is,  we  expect  the  learning  process  for  each 
task  is  designed  with  the  sole  intention  of  estimating  the  target  concept,  without  regard  for  the 
global  transfer  learning  problem.  To  account  for  this,  we  model  the  transfer  learning  problem  in 
a  reduction-style  framework,  in  which  we  suppose  there  is  some  black-box  learning  algorithm  to 
be  run  for  each  task,  which  takes  a  prior  as  input  and  has  a  theoretical  guarantee  of  good  perfor¬ 
mance  provided  the  prior  is  correct.  We  place  almost  no  restrictions  whatsoever  on  this  learning 
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algorithm,  including  the  manner  in  which  it  accesses  the  data.  This  allows  remarkable  generality, 
since  this  procedure  could  be  passive,  active,  semi-supervised,  or  some  other  kind  of  query-based 
strategy.  However,  because  of  this  generality,  we  have  no  guarantee  on  the  information  about  6 * 
reflected  in  the  data  used  by  this  algorithm  (especially  if  it  is  an  active  learning  algorithm).  As 
such,  we  choose  not  to  use  the  label  information  gathered  by  the  learning  algorithm  for  each 
task  when  estimating  the  0*,  but  instead  take  a  small  number  of  additional  random  labeled  ex¬ 
amples  from  each  task  with  which  to  estimate  #*.  Again,  we  want  to  minimize  this  number  of 
additional  samples  per  task;  indeed,  in  this  work  we  are  able  to  make  due  with  a  mere  constant 
number  of  additional  samples  per  task.  To  our  knowledge,  no  result  of  this  type  (estimating  0* 
using  a  bounded  sample  size  per  learning  problem)  has  previously  been  established  at  the  level 
of  generality  studied  here. 

7.3  Estimating  the  Prior 

The  advantage  of  transfer  learning  in  this  setting  is  that  each  learning  problem  provides  some 
information  about  so  that  after  solving  several  of  the  learning  problems,  we  might  hope  to  be 
able  to  estimate  9 Then,  with  this  estimate  in  hand,  we  can  use  the  corresponding  estimated 
prior  distribution  in  the  learning  algorithm  for  subsequent  learning  problems,  to  help  inform 
the  learning  process  similarly  to  how  direct  knowledge  of  0*  might  be  helpful.  However,  the 
difficulty  in  approaching  this  is  how  to  define  such  an  estimator.  Since  we  do  not  have  direct 
access  to  the  hi  values,  but  rather  only  indirect  observations  via  a  finite  number  of  example 
labels,  the  standard  results  for  density  estimation  from  i.i.d.  samples  cannot  be  applied. 

The  idea  we  pursue  below  is  to  consider  the  distributions  on  Ztk(9 *).  These  variables  are  di¬ 
rectly  observable,  by  requesting  the  labels  of  those  examples.  Thus,  for  any  finite  fceN,  this  dis¬ 
tribution  is  estimable  from  observable  data.  That  is,  using  the  i.i.d.  values  Z\ &(#*), . . . ,  Ztk(9+), 
we  can  apply  standard  techniques  for  density  estimation  to  arrive  at  an  estimator  of  P z^e*)-  Then 
the  question  is  whether  the  distribution  P ztk(e+)  uniquely  characterizes  the  prior  distribution 
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that  is,  whether  n0t  is  identifiable  from  P ztk(e+)- 

As  an  example,  consider  the  space  of  half-open  interval  classifiers  on  [0, 1]:  C  =  {Ijj^  : 
0  <  a  <  b  <  1},  where  =  +1  if  a  <  x  <  b  and  —1  otherwise.  In  this  case,  TTe^  is 

not  necessarily  identifiable  from  P.ztl(0*);  f°r  instance,  the  distributions  7TI)]  and  nl)2  characterized 
by  =  ^i({lp,0)})  =  V2  and7T02({lJ1/2)})  =  tt02 ({lg/2>1)})  =  1/2  are  not  dis¬ 

tinguished  by  these  one-dimensional  distributions.  However,  it  turns  out  that  for  this  half-open 
intervals  problem,  7r^  is  uniquely  identifiable  from  P zt2(8*):>  f°r  instance,  in  the  0\  vs  92  sce¬ 
nario,  the  conditional  probability  P(ytl(0i),yt2(0i))|(x:ti,xt2)((+lj  +1)  | (1/4,  3/4))  will  distinguish 
H()t  from  7 tq2,  and  this  can  be  calculated  from  P zt2(di)-  The  crucial  element  of  the  analysis  below 
is  determining  the  appropriate  value  of  k  to  uniquely  identify  7 from  P ztk{6i,)  hi  general.  As  we 
will  see,  k  =  d  (the  VC  dimension)  is  always  sufficient,  a  key  insight  for  the  results  that  follow. 
We  will  also  see  this  is  not  the  case  for  any  k  <  d. 

To  be  specific,  in  order  to  transfer  knowledge  from  one  task  to  the  next,  we  use  a  few  labeled 
data  points  from  each  task  to  gain  information  about  0*.  For  this,  for  each  task  t,  we  simply  take 
the  first  d  data  points  in  the  Zt(9f)  sequence.  That  is,  we  request  the  labels 

T)1(6)*),  T*2 (0*), . .  • ,  Ytd(9f) 

and  use  the  points  Ztd(Q+)  to  update  an  estimate  of  0*. 

The  following  result  shows  that  this  technique  does  provide  a  consistent  estimator  of  7 r^. 
Again,  note  that  this  result  is  not  a  straightforward  application  of  the  standard  approach  to  con¬ 
sistent  estimation,  since  the  observations  here  are  not  the  h*t0  variables  themselves,  but  rather  a 
number  of  the  Yti{0f)  values.  The  key  insight  in  this  result  is  that  7 is  uniquely  identified  by  the 
joint  distribution  P ztd(9+)  over  the  first  d  labeled  examples;  later,  we  prove  this  is  not  necessarily 
true  for  P ztk(e+)  for  values  k  <  d.  This  identifiability  result  is  stated  below  in  Corollary  7.6; 
as  we  discuss  in  Section  7.3.1,  there  is  a  fairly  simple  direct  proof  of  this  result.  However, 
for  our  purposes,  we  will  actually  require  the  stronger  condition  that  any  6  e  0  with  small 
Wztk{9)  ~  P ztk(9+)  ||  also  has  small  \\no  —  nor  ||  •  This  stronger  requirement  adds  to  the  complexity 
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of  the  proofs.  The  results  in  this  section  are  purely  concerned  with  relating  distances  in  the  space 
of  P ztd(6)  distributions  to  the  corresponding  distances  in  the  space  of  n0  distributions;  as  such, 
they  are  not  specific  to  active  learning  or  other  learning  protocols,  and  hence  are  of  independent 
interest. 

Theorem  7.1.  There  exists  an  estimator  Opo,  =  0T(Z  ld(#*), . . .  ,ZTd(9fj),  and  functions  R  : 
N0  x  (0, 1]  — >  [0,  oo)  and  5  :  N0  x  (0, 1]  — >  [0, 1],  such  that  for  any  a  >  0,  lim  R(T,  a)  = 

T-»  oo 

lim  5{T,  a)  =  0  and  for  any  T  G  No  and  9 *  G  0, 

T— >oo 

P  _  ^Jl  >  R(T’a))  <  S(T,a)  <  a. 

One  important  detail  to  note,  for  our  purposes,  is  that  R(T,  a)  is  independent  from  0*,  so 
that  the  value  of  R{T,  a)  can  be  calculated  and  used  within  a  learning  algorithm.  The  proof  of 
Theorem  7.1  will  be  established  via  the  following  sequence  of  lemmas.  Lemma  7.2  relates  dis¬ 
tances  in  the  space  of  priors  to  distances  in  the  space  of  distributions  on  the  full  data  sets.  In  turn. 
Lemma  7.3  relates  these  distances  to  distances  in  the  space  of  distributions  on  a  finite  number  of 
examples  from  the  data  sets.  Lemma  7.4  then  relates  the  distances  between  distributions  on  any 
finite  number  of  examples  to  distances  between  distributions  on  d  examples.  Finally,  Lemma  7.5 
presents  a  standard  result  on  the  existence  of  a  converging  estimator,  in  this  case  for  the  distri¬ 
bution  on  d  examples,  for  totally  bounded  families  of  distributions.  Tracing  these  relations  back, 
they  relate  convergence  of  the  estimator  for  the  distribution  of  d  examples  to  convergence  of  the 
corresponding  estimator  for  the  prior  itself. 

Lemma  7.2.  For  any  9,9'  G  0  and  t  e  N, 

\\tte  ~  7T0'||  =  ||P.Zt(0)  —  P.Zt(0')||- 

Proof  Fix  9,9'  6  0,16  N.  Let  X  =  {Xn,  Xt2, . . .},  Y(0)  =  {Ytl{9),  Yt2 (9), . . .},  and  for 
k  e  N  let  Xfc  =  {Xtl, ...,  Xtk}.  and  Yk(0)  =  {Yn{9), ...,  Ytk(9)}.  For  h  e  C,  let  cx(h)  = 
{(Xtl,h(Xtl)),(Xt2,h(Xt2)),...}. 

For  h,  g  G  C,  define  p%(h,g)  =  lim  P  Ya=i  1  [h(Xti)  f  g(Xti)\  (if  the  limit  exists),  and 

m— »■  oo 

pxk(h,g)  =  {  ^2^=1 1  [h(Xti)  f  g(Xti)\.  Note  that  since  C  has  finite  VC  dimension,  so  does 
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the  collection  of  sets  {{a;  :  h(x)  ^  g(x)}  :  h,g  G  C},  so  that  the  uniform  strong  law  of 
large  numbers  implies  that  with  probability  one,  VTi,  g  G  C,  px{h,  g)  exists  and  has  px(h,  g)  = 
p(h,g)  [Vapnik,  1982]. 

Consider  any  9,  6'  G  0,  and  any  A  G  B.  Then  since  B  is  the  Borel  a-algebra  induced  by  p, 

any  h  ^  A  has  \/g  G  A,  p(h,  g)  >  0.  Thus,  if  px(h,  g)  =  p(h,  g)  for  all  h,  g  G  C,  then  \/h  ^  A, 

Wg  G  A,px(h,g)  =  p(h,g )  >  0  =>•  Wg  G  A,cx(h)  ±  cx(g)  =>  cx(h)  cx(A). 

This  implies  rv1  {cx{A))  =  A.  Under  these  conditions, 

I®2t(0)|x(cx(^4))  =  7re(cx1(cx(^)))  =  TTe(A), 

and  similarly  for  9' . 

Any  measurable  set  C  for  the  range  of  Z,  (9)  can  be  expressed  as  C  —  {cs{h)  :  (h,  x)  G  C'} 
for  some  appropriate  C'  G  B  <E>  B^.  Letting  =  {h  :  (/i,  x)  G  C"},  we  have 

Pb.(»)(C)  =  y'Ir(,(cr1(cI(Q)))Px(dJ)  =  |  ir»(Q)Px(di)  =  P 'w,x)(C'). 

Likewise,  this  reasoning  holds  for  0'.  Then 


11^(0)  -  P.Zt(0') 


||P(h*e,X)  -  P(/i*e„X)|| 

sup  f  (ne(C's)  -  7T0/(C'^))Px(dx) 

.7 


< 


sup  |vre(A)  -  7i>(A)|Px(da;)  =  ||7re  -  7i> 
A&B 


Since  h*te  and  X  are  independent,  for  A  G  B,7ie(A)  =  P h*tg(A)  =  P/)*e(A)Px(T’°°)  =  F(h*gX)(Ax 
X°°).  Analogous  reasoning  holds  for  h*te,.  Thus,  we  have 


ho  -  M  =  ll%e,X)(-  x  *°°)  -  v{hle,M- x  *°°)ll 


<  ||%tVx)  -%9„X)II  =  II FZt(d)  ~^zt(0') 


Combining  the  above,  we  have  ||P^t(0)  —  ^zt{9')  II  =  he  —  ^e' 


□ 
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Lemma  7.3.  There  exists  a  sequence  rk  =  o(l)  such  that  'it.  k  G  N,  \/9, 9 '  G  0, 


W ztk{0)  -  ^ ztk{e')\\  <  Ike-^'ll  <  \\^ztk(9) -^ztk{e')\\  +  rk- 

Proof  The  left  inequality  follows  from  Lemma  7.2  and  the  basic  definition  of  ||  •  ||,  since 

P ztk(e){ •)  =  F2t(0)(-  x  (T’  x  {-1,  +1})°°),  so  that 

\\^ztk(j9)  -  ^ztk{e')  ||  <  II ^Zt{9)  -  F.Zt(<9')||  =  Ike  -  Tty'll- 

The  remainder  of  this  proof  focuses  on  the  right  inequality.  Fix  9,6'  G  0,  let  7  >  0,  and  let 
B  C  (A  x  {— l,+l})°°bea  measurable  set  such  that 


Tty  -  7T0'||  —  ||P.2:t(0)  -  P.zt(6>')||  <  Pzt(d)(B)  -  P zt(e')(B)  +  7. 


Let  A  be  the  collection  of  all  measurable  subsets  of  (X  x{  — 1,+1})°°  representable  in  the  form 
A  x  [X  x  {  — 1,+1})°°,  for  some  measurable  A'  C  (A  x  {  —  1, +l})fc  and  some  k  G  N.  In 
particular,  since  A  is  an  algebra  that  generates  the  product  a-algebra,  Caratheodory’s  extension 
theorem  [Schervish,  1995]  implies  that  there  exist  disjoint  sets  { in  A  such  that  B  C 

UieN  A  and 

Pz,m(B)-Pz,m(B)<ZPZ‘  (9)  (A)  -  y^p  zt(0')(A) +7- 

ieN  i£N 

Additionally,  as  these  sums  are  bounded,  there  must  exist  n  G  N  such  that 

n 

y^p  zt(e)(A)  <  7 + y^p  zt(6)(A), 


y^p  zt(0)(A)  -  y^p  zt(0')(A)  <  7 + y^p^t(e)(A)  -  y^p  z^iA) 

ieN  ieN  i=  1  i= 1 


=  7  +  P  Zt{6) 


~  P Zt{0') 


121 


As  (J”=1  Ai  G  A,  there  exists  k'  G  N  and  measurable  A'  C  (A  x  {  — 1,  +1})^  such  that  (J"=1  = 

A'  x  (A  x  {—1,  +1})°°,  and  therefore 


p^(e)  (  U  4  (  U  Ai )  =  ¥ztk'(0)(A>)  ~ 


\i= 1 


U=1 


<  \\Fztk,(0)  -  Fztk,{6')  ||  <  Hm  ||P^(0)  -  p.ztAe')l 


In  summary,  we  have  |ke  —  7re/||  <  lim*..**,  \\^ztk{6)  ~  Fztk(e')\\  +  3y.  Since  this  is  true  for  an 
arbitrary  7  >  0,  taking  the  limit  as  7  — »  0  implies 


Ike  -  7T0/||  <  hm  ||P^fc(0)  -  P^fc(e')ll- 

In  particular,  this  implies  there  exists  a  sequence  r^(0,  0')  =  o(l)  such  that 


Vfc  e  N,  |ke  —  Try'll  <  ll^fc(e)  -Fztk{o>)\\  +rk(6,6'). 


This  would  suffice  to  establish  the  upper  bound  if  we  were  allowing  rk  to  depend  on  the  par¬ 
ticular  9  and  6'.  However,  to  guarantee  the  same  rates  of  convergence  for  all  pairs  of  parameters 
requires  an  additional  argument.  Specifically,  let  7  >  0  and  let  @7  denote  a  minimal  subset  of  0 
such  that,  \/6  G  0,  3 #7  G  07  s.t.  |ke  —  7T©  ||  <  7:  that  is,  a  minimal  7-cover.  Since  |@7|  <  00 
by  assumption,  defining  rk( 7)  =  max0)6)/g©7  rk(().  O’ ),  we  have  rk{ 7)  =  o(l).  Furthermore,  for 
any  9,  9'  G  0,  letting  07  =  argmine„g0  \\ne  —  n e»\\  and  9^  =  argmine„g07  Ike'  —  7re// 1 1 ,  we  have 
(by  triangle  inequalities) 

Ike  —  7T0'||  <  Ike  —  ^0-yW  +  lke7  —  ^e;||  +  Ike^  —  ^e'H 
<  2y  +  rk( 7)  +  ||Pztfc(e7)  -  P^(e;)||- 

By  triangle  inequalities  and  the  left  inequality  from  the  lemma  statement  (established  above),  we 
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also  have 


II ^Ztkie-,)  ~  pZtk(0!y)  II 

<  IIPz^t)  -  ^ztk{0) II  +  II ^ztk{9)  ~  ^ztk{e') ||  +  II ^ztk(e')  ~  ^ztk(e^)\\ 

<  lke7  -  *e\\  +  ||P^fc(«)  -  ^ztk(0')  ||  +  he'  -  vr^|| 

<  27  +  ||P^fc(0)  -  P^tfc(0')ll- 

Defining  rk  =  infT>0  (47  +  77(7)),  we  have  the  right  inequality  of  the  lemma  statement,  and 
since  rk{ 7)  =  o(l)  for  each  7  >  0,  we  have  rk  =  o(l).  □ 

Lemma  7.4.  Vt,  fceN,  V0, 0'  G  0, 

IIP^OT  -  P^otII  <  4  •  2“-w*:',v/||P^«,)-P^(#-)||. 

Proof.  Fix  any  t  G  N,  and  let  X  =  {Xn,Xt2, . . .}  and  Y(0)  =  {Yn(9),Yt2(9)% . . and  for 
k  G  N  let  Xfc  =  {Xib  . . . ,  Xtk }  and  Yfc(0)  =  {Ytl(9), Ytk(9)}. 

If  k  <  d,  then  FZtkWf)  =  FZtd{e)f  x  (X  x  {— 1,  +l})d_fc),  so  that 

||P^(0)  —  Pztfc(0')ll  <  Wztd{8)  -Pztd(0')||, 

and  therefore  the  result  trivially  holds. 

Now  suppose  k  >  d.  For  a  sequence  z  and  7  C  N,  we  will  use  the  notation  zk  =  {zi  :*€/}. 
Note  that,  for  any  k  >  d  and  xk  G  Xk,  there  is  a  sequence  y(xk)  G  {  —  1,  +l}fc  such  that  no 
h£  C  has  h(xk)  =  y{xk )  (i.e.,  \/h  G  C,  3 i  <  k  s.t.  h(xk)  f  yfxf)).  Now  suppose  k  >  d  and 
take  as  an  inductive  hypothesis  that  there  is  a  measurable  set  A*  C  X°°  of  probability  one  with 
the  property  that  \/x  G  A*,  for  every  finite  I  C  N  with  |/|  >  7,  for  every  y  G  {— 1,+1}°°  with 

\\Vi  ~y(xi)h/2  <k-  1, 

|Pyj(0)|Xj {yi\^i)  -  Py/(6»')|xj(i//|^/)| 

<  2fc_1  •  _d  .  max  \¥Ydmxd(yd\xD)  -  FYdmiXd(yd\xD)  \  . 

ya£{—l,+lfa,D£la 
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This  clearly  holds  for  || y7  —  y(xI)\\i/2  =  0,  since  PY/((9)|xJ(l//|^/)  =  0  in  this  case,  so  this 
will  serve  as  our  base  case  in  the  inductive  proof.  Next  we  inductively  extend  this  to  the  value 
k  >  0.  Specifically,  let  A*k_x  be  the  A*  guaranteed  to  exist  by  the  inductive  hypothesis,  and  fix 
any  x  e  A*,  y  e  {—1,  +1}°°,  and  finite  I  CN  with  \I\  >  d  and  \\yT  —  y{xi)  ||i/2  =  k.  Let  %  e  / 
be  such  that  y,  ^  yi{xj),  and  let  y'  e  {—1,  +1}  have  y'  =  y:i  for  every  j  ^  i,  and  y'  =  —  yt. 
Then 

PY/(0)|X/(y/|^/)  =  PYA{i}(0)|XA{i}(y/\{i}I^A{*})  “  PYj(0)|Xj(y/|^/),  (7.1) 

and  similarly  for  6' .  By  the  inductive  hypothesis,  this  means 

| PY/(0)|Xj  (yi\xi )  -  PY/(0')|X/(y/|^/)| 

^  FYA{i}(0)|xA{i}(^7\{i}I^A{d)  “  pYA{i}(0')|xA{i}(^A{ol^A(d) 

+  |FY/(6>)|X/(^/|^7)  -^PY/(0')|X/(y/|^7)| 

<2k  ■  max  |PYd(e)|xd(yd|^D)  -  ^Yd{e')\xd(yd\xD)\  ■ 
yde{-l,+l}d,DeId 

Therefore,  by  the  principle  of  induction,  this  inequality  holds  for  all  k  >  d,  for  every  x  e  A*, 
y  e  {—1,  +1}°°,  and  finite  /  cN,  where  A*  has  V°° -probability  one. 

In  particular,  we  have  that  for  9,  O’  e  0, 

\\^ztk{6)  -  ^ztk(0’)\\ 

<  2^E  max  |PYfc(e)|xfc(y^|Xfc)  -  PYfc(e')|xfc(yA|XA,.)| 

<  22AE  max  |PYd(0)|xd(yd|XD)  -  PYd(0/)|Xd(y'i|XD)  | 

yde{-l,+l}d,De{l,...,k}d 

<  22A  X]  22  E  [|pYd(0)|xd(yd|XD)  -  PYd^oiXd^lXr,)!]  . 

yd£{- i,+i}d  De{i,...,fc}d 

Exchangeability  implies  this  is  at  most 

22fc  V  X  E[|PY4»)|Xi(y'<|Xd)-PT4»')P,te'i|XJ)|] 

yd&{- i,+i}d  De{i,...,k}d 

<2  2Mkd  max  E[|Py„((,)|Xj®‘'|Xi)-Pw)|Xi,(S‘i|Xi)|]. 

yaG{  —  !,+!}“ 
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To  complete  the  proof,  we  need  only  bound  this  value  by  an  appropriate  function  of  ||P_Z4d(^  — 
P-Ztd(0')ll-  Toward  this  end,  suppose 

E  [|P¥d((9)|Xd(yd|Xd)  -  P¥d((?/)|xd(yd|Xd)|]  >  £, 

for  some  yd.  Then  either 

F  {^Yd(9)\xd(yd\^d)  -  W>Yd(o')\xd{yd\ X<*)  >  s/4)  >  e/4, 

or 

F  (F¥d(0')|xd(^d|Xd)  -  PYd(0)|xd(^d|Xd)  >  e/4)  >  e/4. 

For  which  ever  is  the  case,  let  Ae  denote  the  corresponding  measurable  subset  of  Xd,  of  proba¬ 
bility  at  least  e/4.  Then 

Wztdie)  II  ^  | ^ztd(e)(A  *  {yrf})  ~^ztd(e')(A  *  (yd})| 

>  (£/4)Px„(.45)  >  e/16. 

Therefore, 

E  [|FYd(0)|Xd(^d|Xd)  -  PYd(0')|Xd(^d|Xd)|]  <  Wztd(p)  -  ^Ztd{6')  II, 

which  means 

2  2k+V  max  E[|PYi(1))|xd(S‘i|X,,)-PY,(<,.)|x,(S-!|Xa)|] 

yaG{— 1,+1}® 

<  4  ■  22t+Vv/||P^(#l  -  P^H. 

□ 

The  following  lemma  is  a  standard  result  on  the  existence  of  converging  density  estima¬ 
tors  for  totally  bounded  families  of  distributions.  For  our  purposes,  the  details  of  the  estimator 
achieving  this  guarantee  are  not  particularly  important,  as  we  will  apply  the  result  as  stated.  For 
completeness,  we  describe  a  particular  estimator  that  does  achieve  the  guarantee  after  the  lemma. 
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Lemma  7.5.  [Devroye  and  Lugosi,  2001,  Yatracos,  1985]  Let  V  =  {pe  :  0  G  0}  be  a  totally 
bounded  family  of  probability  measures  on  a  measurable  space  (0,  R),  and  let  { IT"/  ( ^ ) }  /,e  n ,  i9  e  o 
be  Ll-valued  random  variables  such  that  {W,{9)}t&^  are  i.i.d.  pe  for  each  0  G  0.  Then  there 
exists  an  estimator  9rot  =  9T{Wi(9f), . . . ,  Wr{0f))  and  functions  Rj>  :  N0  x  (0, 1]  — >  [0,  oo) 
and  8-p  :  No  x  (0, 1]  — >  [0, 1]  such  that  Mat  >  0,  lim^oo  Rv(T,  a)  =  lim^oo  fo  (T,  a)  =  0,  and 
V9+  G  0  and  T  G  No, 

P  (lbeTfl*  -Po*  II  >  Rv(T »)  <  &v{T,ot)  <  «• 

In  many  contexts  (though  certainly  not  all),  even  a  simple  maximum  likelihood  estimator 
suffices  to  supply  this  guarantee.  However,  to  derive  results  under  the  more  general  condi¬ 
tions  we  consider  here,  we  require  a  more  involved  method:  specifically,  the  minimum  dis¬ 
tance  skeleton  estimate  explored  by  [Devroye  and  Lugosi,  2001,  Yatracos,  1985],  specified  as 
follows.  Let  0£  C  0  be  a  minimal-cardinality  e-cover  of  0:  that  is,  a  minimal-cardinality  sub¬ 
set  of  0  such  that  W9  G  0,  3 0£  G  0£  with  \\pee  —  p$\\  <  e.  For  each  9,9'  G  0£,  let  Aq^ 
be  a  set  in  T  maximizing  po(Agps)  —  pgfAgpi),  and  let  Ae  =  {Ae^  :  9,9'  G  0£},  known 
as  a  Yatracos  class.  Finally,  for  A  G  T,  let  fir  (A)  =  T~L  YlJ=i  1a(W3(0*)).  The  mini¬ 
mum  distance  skeleton  estimate  is  Ore*  =  argmin0e0£  supAe-4e  \pe(A)  —  fir(A)\.  The  reader 
is  referred  to  [Devroye  and  Lugosi,  2001,  Yatracos,  1985]  for  a  proof  that  this  method  satis¬ 
fies  the  guarantee  of  Lemma  7.5.  In  particular,  if  eT  is  a  sequence  decreasing  to  0  at  a  rate 
such  that  T_1  log(|0£T|)  — >•  0,  and  5T  is  a  sequence  bounded  by  a  and  decreasing  to  0  with 
Sr  =  uj(eT  +  \jT~l  log(|0£T|)),  then  the  result  of  [Devroye  and  Lugosi,  2001,  Yatracos,  1985], 
combined  with  Markov’s  inequality,  implies  that  to  satisfy  the  condition  of  Lemma  7.5,  it  suffices 
to  take  R-p(T,a)  =  Sf1  (3eT  +  \/8T~l  log(2|@£T|2  V  8 and  8-p(T,a)  =  ST.  For  instance, 
£t  =  2  inf  je  >  0  :  log(|0£|)  <  s/T j  and  8r  =  ol  A  (foef  +  T^1/8)  suffice. 

We  are  now  ready  for  the  proof  of  Theorem  7.1 

Theorem  7.1.  For  £  >  0,  let  0£  C  0  be  a  finite  subset  such  that  \/9  G  0,  3 9e  G  @£  with 
\\itge  —  X()\\  <  e;  this  exists  by  the  assumption  that  {itg  :  0  G  0}  is  totally  bounded.  Then 
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Lemma  7.3  implies  that  V0  G  0,  3 0£  G  0£  with  ||P Ztd(es)  ~  ^ztd{0) II  <  hes  -  ne\\  <  e, 
so  that  {P Ztd(9e)  ■  d£  G  0J  is  a  finite  e-cover  of  {P Ztd(e)  :  0  G  0}.  Therefore,  {P Ztd(e)  ■ 
9  G  0}  is  totally  bounded.  Lemma  7.5  then  implies  that  there  exists  an  estimator  9Tq *  = 
6T(Zld(9+), . . . ,  ZTd(df)  and  functions  Rd  :  N0  x  (0, 1]  -G  [0,  oo)  and  5d  :  N0  x  (0, 1]  -G-  [0, 1] 
such  that  Vet  >  0,  limT->.oo  Rd(T,  a)  =  limT^oo  Sd(T,  a)  =  0,  and  V0*  G  0  and  T  G  N0, 

P  (llP^(T+i)d(eTs*)|0T0*  ~~  P^(T+i)d(^)ll  >  Rd{T,a)J  <  5d(T,  a)  <  a.  (7.2) 

Defining 

R(T,  a)  =  min  (rk  +  4  •  2 2k+dkd^Rd{T,a))  , 
and  5(T,  a)  =  Sd(T ,  a),  and  combining  (7.2)  with  Lemmas  7.4  and  7.3,  we  have 

P  (iK™*  -  >  R(T,ai)  <  <  a. 

Finally,  note  that  lim  rk  =  0  and  lim  Rd(T,  a)  =  0  imply  that  lim  R(T,  a)  =  0.  □ 

k—>  oo  T—>  oo  T— >■  oo 


7.3.1  Identifiability  from  d  Points 

Inspection  of  the  above  proof  reveals  that  the  assumption  that  the  family  of  priors  is  totally 
bounded  is  required  only  to  establish  the  estimability  and  bounded  minimax  rate  guarantees.  In 
particular,  the  implied  identifiability  condition  is,  in  fact,  always  satisfied,  as  stated  formally  in 
the  following  corollary. 

Corollary  7.6.  For  any  priors  7Ti,  7t2  on  C,  ifh*  ~  A',, . . . ,  Xd  are  i.i.d.  V  independent  from 

h*,  and  Zd(i )  =  {(A3,  h*( A3)), . . . ,  ( Xd ,  h*(Xd))}  for  i  G  {1,  2},  then  Pzd(l)  -  Pzd(2)  => 
Hi  =  vr2. 


Proof.  The  described  scenario  is  a  special  case  of  our  general  setting,  with  0  =  {1,  2},  in  which 
case  P zd(i)  =  P zld(i).  Thus,  if  Pzd(i)  =  Pzd(2),  then  Lemma  7.4  and  Lemma  7.3  combine  to 
imply  that  || 7Ti  —  7t2||  <  inffcSN  rk  =  0.  □ 
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Since  Corollary  7.6  is  interesting  in  itself,  it  is  worth  noting  that  there  is  a  simple  direct  proof 
of  this  result.  Specifically,  by  an  inductive  argument  based  on  the  observation  (7.1)  from  the 
proof  of  Lemma  7.4,  we  quickly  find  that  for  any  k  E  N,  Pzlk<op  is  identifiable  from  P ztd(e*)- 
Then  we  merely  recall  that  P^^  is  always  identifiable  from  {P 'ztk(e*)  ■  k  G  N}  [Kallenberg, 
2002],  and  the  argument  from  the  proof  of  Lemma  7.2  shows  7 r<^  is  identifiable  from  P zt(6*)- 

It  is  natural  to  wonder  whether  identifiability  of  from  P 'ztk(e*)  remains  true  for  some 
smaller  number  of  points  k  <  d,  so  that  we  might  hope  to  create  an  estimator  for  based  on 
an  estimator  for  P 'ztk(o*)-  However,  one  can  show  that  d  is  actually  the  minimum  possible  value 
for  which  this  remains  true  for  all  V  and  all  families  of  priors.  Formally,  we  have  the  following 
result,  holding  for  every  VC  class  C. 

Theorem  7.7.  There  exists  a  data  distribution  V  and  priors  7Ti,  7t2  on  C  such  that,  for  any  pos¬ 
itive  integer  k  <  d,  if  h*  ~  7Tt,  Xi,...,Xk  are  i.i.d.  V  independent  from  h*,  and  Zk(i )  = 
{(Xi,  h*(X i)), . . . ,  (Xfc,  h*(Xk))}fori  G  {1,  2},  then  Pzfc(i)  =  Pzfc( 2)  but  7Ti  f  n2. 


Proof.  Note  that  it  suffices  to  show  this  is  the  case  for  k  =  d  —  1,  since  any  smaller  A;  is  a 
marginal  of  this  case.  Consider  a  shatterable  set  of  points  Sd  =  {xi,x2,  ■  ■  ■ ,  xd}  C  X,  and  let 
V  be  uniform  on  Sd.  Let  C  [Sd]  be  any  2d  classifiers  in  C  that  shatter  Sd.  Let  7Ti  be  the  uniform 
distribution  on  C [S],  Now  let  Sd_  1  =  {x±, . . .  ,xd_i}  and  C[S'rf_i]  C  C [Sd\  shatter  Sd_  1  with 
the  property  that  \/h  G  C[S'rf_i],  h(xd)  =  rT/=i  h(xj).  Let  n2  be  uniform  on  C[S'd_i].  Now 
for  any  k  <  d  and  distinct  indices  ti, ...  ,tk  G  {1, . . . ,  d},  {h*(xtl), . . . ,  h*  (xtk)}  is  distributed 
uniformly  in  {-l,+l}fc  for  both  i  G  {1,2}.  This  implies  f>Zd_1(i)\x1,...,xd.1  =  Pzd_1(2)|Xi,...,xd_1, 
which  implies  Pzd_!(i)  =  1(2)-  However,  7Ti  is  clearly  different  from  7 r2,  since  even  the  sizes 

of  the  supports  are  different.  □ 
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7.4  Transfer  Learning 


In  this  section,  we  look  at  an  application  of  the  techniques  from  the  previous  section  to  transfer 
learning.  Like  the  previous  section,  the  results  in  this  section  are  general,  in  that  they  are  ap¬ 
plicable  to  a  variety  of  learning  protocols,  including  passive  supervised  learning,  passive  semi- 
supervised  learning,  active  learning,  and  learning  with  certain  general  types  of  data-dependent 
interaction  (see  [Hanneke,  2009]).  For  simplicity,  we  restrict  our  discussion  to  the  active  learning 
formulation;  the  analogous  results  for  these  other  learning  protocols  follow  by  similar  reasoning. 

The  result  of  the  previous  section  implies  that  an  estimator  for  0*  based  on  d-dimensional  joint 
distributions  is  consistent  with  a  bounded  rate  of  convergence  R.  Therefore,  for  certain  prior- 
dependent  learning  algorithms,  their  behavior  should  be  similar  under  ti0to  to  their  behavior 
under  n . 

To  make  this  concrete,  we  formalize  this  in  the  active  learning  protocol  as  follows.  A  prior- 
dependent  active  learning  algorithm  A  takes  as  inputs  e  >  0,  V,  and  a  distribution  n  on  C.  It 
initially  has  access  to  Xi,X2, . . .  i.i.d.  V;  it  then  selects  an  index  i \  to  request  the  label  for, 
receives  Yn  =  h*(Xri),  then  selects  another  index  i2,  etc.,  until  it  eventually  terminates  and 
returns  a  classifier.  Denote  by  Z  =  {(A^,  h*(Xi)),  (X2,  h*(X2 )), . . .}.  To  be  correct ,  A  must 
guarantee  that  for  h*  ~  it,  Ve  >  0,  E  [p(A{e,  V,  7r),  h*)]  <  e.  We  define  the  random  variable 
N  (A,  f,  e,  V,  7r)  as  the  number  of  label  requests  A  makes  before  terminating,  when  given  s,  V, 
and  7 r  as  inputs,  and  when  h*  =  f  is  the  value  of  the  target  function;  we  make  the  particular 
data  sequence  Z  the  algorithm  is  run  with  implicit  in  this  notation.  We  will  be  interested  in  the 
expected  sample  complexity  SC  {A,  £,  D,  7r)  =  E  [N(A,  h*,  e,  V,  7r)]. 

We  propose  the  following  algorithm  AT  for  transfer  learning,  defined  in  terms  of  a  given 
correct  prior-dependent  active  learning  algorithm  Aa ■  We  discuss  interesting  specifications  for 
Aa  in  the  next  section,  but  for  now  the  only  assumption  we  require  is  that  for  any  £  >  0  and 
V,  there  is  a  value  s£  <  oo  such  that  for  every  7r  and  /  e  C,  N(Aa,  f,  £,  T>,  n)  <  s£;  this 
is  a  very  mild  requirement,  and  any  active  learning  algorithm  can  be  converted  into  one  that 
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satisfies  this  without  significantly  increasing  its  sample  complexities  for  the  priors  it  is  already 
good  for  [Balcan,  Hanneke,  and  Vaughan,  2010].  We  additionally  denote  by  me  =  —  In  (y), 
and  B(0, 7)  =  {9'  E  0  :  \\n g  —  77/ 1|  <  7}. 

Algorithm  1  Ar(T.  e):  an  algorithm  for  transfer  learning,  specified  in  terms  of  a  generic  subrou¬ 
tine  Aa- 

for  t  =  1.2..., ,  /'  do 

Request  labels  Ytl(6f)-,  ■  ■  ■  ?  Ytd(0*) 
if  R(t  —  1,  e/2)  >  e/8  then 

Request  labels  Yt(d+ 1)(6»*), . . . ,  Ytme(0f) 

Take  ht  as  any  he  C  s.t.  Vi  <  m£,  h(Xti)  =  Yu(6+) 
else 

Let  Otot  G  B  R(t  —  1,  e/2)  j  be  such  that 

SC(Aa,e/4:,V,TTS  )  <  min  SC(Aa,  e/4,  V,  7r0)  +  1/t 

1,£/2)) 

Run  Vla (e/4,  77^  )  with  data  sequence  Zt(6+)  and  let  ht  be  the  classifier  it  returns 

end  if 
end  for 


Recall  that  which  is  defined  by  Theorem  7.1,  is  a  function  of  the  labels  requested 

on  previous  rounds  of  the  algorithm;  R(t  —  1,  e/2)  is  also  defined  by  Theorem  7.1,  and  has  no 
dependence  on  the  data  (or  on  6 A-  The  other  quantities  referred  to  in  Algorithm  1  are  defined 
just  prior  to  Algorithm  1.  We  suppose  the  algorithm  has  access  to  the  value  SC(Aa,  s/4,  V,  77) 
for  every  6  e  0.  This  can  sometimes  be  calculated  analytically  as  a  function  of  6,  or  else  can 
typically  be  approximated  via  Monte  Carlo  simulations.  In  fact,  the  result  below  holds  even  if 
SC  is  merely  an  accessible  upper  bound  on  the  expected  sample  complexity. 

Theorem  7.8.  The  algorithm  Ar  is  correct.  Furthermore ,  if  Sr  (s  )  is  the  total  number  of  label 
requests  made  by  At(T,  e),  then  lim  sup  <  SC(Aa,  s/ 4,  V ,  77J  +  d. 

T— >■  OO 

The  implication  of  Theorem  7.8  is  that,  via  transfer  learning,  it  is  possible  to  achieve  al- 
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most  the  same  long-run  average  sample  complexity  as  would  be  achievable  if  the  target’s  prior 
distribution  were  known  to  the  learner.  We  will  see  in  the  next  section  that  this  is  sometimes 
significantly  better  than  the  single-task  sample  complexity.  As  mentioned,  results  of  this  type  for 
transfer  learning  have  previously  appeared  when  Aa  is  a  passive  learning  method  [Baxter,  19971; 
however,  to  our  knowledge,  this  is  the  first  such  result  where  the  asymptotics  concern  only  the 
number  of  learning  tasks,  not  the  number  of  samples  per  task;  this  is  also  the  first  result  we  know 
of  that  is  immediately  applicable  to  more  sophisticated  learning  protocols  such  as  active  learning. 

The  algorithm  Ar  is  stated  in  a  simple  way  here,  but  Theorem  7.8  can  be  improved  with 
some  obvious  modifications  to  Ar.  The  extra  “+c/”  in  Theorem  7.8  is  not  actually  necessary, 
since  we  could  stop  updating  the  estimator  9tgt  (and  the  corresponding  R  value)  after  some  o(T ) 
number  of  rounds  (e.g.,  VT),  in  which  case  we  would  not  need  to  request  . . . ,  Ytd{9C) 

for  t  larger  than  this,  and  the  extra  d  ■  o{T )  number  of  labeled  examples  vanishes  in  the  average 
as  T  — >  oo.  Additionally,  the  e/4  term  can  easily  be  improved  to  any  value  arbitrarily  close  to  e 
(even  (1  —  o(l))e)  by  running  Aa  with  argument  e  —  2 R(t  —  1,  e/2)  —  5(t  —  1,  e/2)  instead  of 
e/4,  and  using  this  value  in  the  SC  calculations  in  the  definition  of  Qto+  as  well.  In  fact,  for  many 
algorithms  Aa  (e.g.,  with  SC(Aa,  e,  V,  ngJ  continuous  in  e),  combining  the  above  two  tricks 
yields  linrsup  <  SC(Aa ,  e,  V,  ngj. 

T— >•  oo 

Returning  to  our  motivational  remarks  from  Subsection  7.2.1,  we  can  ask  how  many  extra  la¬ 
beled  examples  are  required  from  each  learning  problem  to  gain  the  benefits  of  transfer  learning. 
This  question  essentially  concerns  the  initial  step  of  requesting  the  labels  Yn  (0*), . . . ,  Ytd(9+). 
Clearly  this  indicates  that  from  each  learning  problem,  we  need  at  most  d  extra  labeled  examples 
to  gain  the  benefits  of  transfer.  Whether  these  d  label  requests  are  indeed  extra  depends  on  the 
particular  learning  algorithm  Aa',  that  is,  in  some  cases  (e.g.,  certain  passive  learning  algorithms), 
Aa  may  itself  use  these  initial  d  labels  for  learning,  so  that  in  these  cases  the  benefits  of  trans¬ 
fer  learning  are  essentially  gained  as  a  by-product  of  the  learning  processes,  and  essentially  no 
additional  labeling  effort  need  be  expended  to  gain  these  benefits.  On  the  other  hand,  for  some 
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active  learning  algorithms,  we  may  expect  that  at  least  some  of  these  initial  d  labels  would  not 
be  requested  by  the  algorithm,  so  that  some  extra  labeling  effort  is  expended  to  gain  the  benefits 
of  transfer  in  these  cases. 

One  drawback  of  our  approach  is  that  we  require  the  data  distribution  D  to  remain  fixed 
across  tasks  (this  contrasts  with  [Baxter,  1997]).  However,  it  should  be  possible  to  relax  this 
requirement  in  the  active  learning  setting  in  many  cases.  For  instance,  if  X  =  Mfc,  then  as  long 
as  we  are  guaranteed  that  the  distribution  Vt  for  each  learning  task  has  a  strictly  positive  density 
function,  it  should  be  possible  to  use  rejection  sampling  for  each  task  to  guarantee  the  d  queried 
examples  from  each  task  have  approximately  the  same  distribution  across  tasks.  This  is  all  we 
require  for  our  consistency  results  on  9rot  (i.e.,  it  was  not  important  that  the  d  samples  came 
from  the  true  distribution  V,  only  that  they  came  from  a  distribution  under  which  p  is  a  metric). 
We  leave  the  details  of  such  an  adaptive  method  for  future  consideration. 

7.4.1  Proof  of  Theorem  7.8 

Recall  that,  to  establish  correctness,  we  must  show  that  Vt  <  T,  E  p  (fit,  h*te^j  <  e,  regardless 
of  the  value  of  0*  E  0.  Fix  any  0*  E  0  and  t  <  T.  If  R[t  —  l,e/2)  >  e/8,  then  classic 
results  from  passive  learning  indicate  that  E  p  (jit,  h*te^j  <  e  [Vapnik,  1982],  Otherwise,  by 
Theorem  7.1,  with  probability  at  least  1  — e/2,  we  have  ||  7%— 7^  ||  <  R(t  —  l,e/2).  On  this 

event,  if  R(t  —  1,  e/2)  <  e/8,  then  by  a  triangle  inequality  || 7r §tg  —  7reJ|  <  2R(t  —  1,  e/2)  <  e/4. 
Thus, 

E  p(ht,h*te}j  <  E  E  p(ht,h*e^j  9te *  1  ||vr^  -  vreJ|  <  e/4  +e/2.  (7.3) 

For  9  E  0,  let  hte  denote  the  classifier  that  would  be  returned  by  Vl,/(e/4.  V.  ti0/o  )  when 
run  with  data  sequence  {(Xa,  h*d(Xt i)),  (Xt2,  h*0( Xt2)), . . .}.  Note  that  for  any  9  E  0,  any 
measurable  function  F  :  C  — >•  [0, 1]  has 

E  [F (ft,V)]  <  E  +  ||%  -  (7.4) 
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In  particular,  supposing  ||7r §  —  7r<?J|  <  e/4,  we  have 

E  p(ht,hk)  6tot  =  E  p(hte^h*te)j  9t0i, 

<  E  [p  (kem,h*tem)  |^*]  +  ~  <  £/4  +  £/4  =  £/2- 


Combined  with  (7.3),  this  implies  E  p  (ht,  h 


k  <£• 


We  establish  the  sample  complexity  claim  as  follows.  First  note  that  convergence  of  R(t  — 
l,e/2)  implies  that  limT^.00^^=1 1  [R(t,e/ 2)  >  e/8]  /T  =  0,  and  that  the  number  of  labels 
used  for  a  value  of  t  with  R(t  —  1,  e/2)  >  e/8  is  bounded  by  a  finite  function  m£  of  e.  Therefore, 

,imsupE%M 

T— )■  oo  T 

T 

<d+  lirn sup E  N(Aa,  h*w^  e/4,  V,  ir6  )  -  1,  e/2)  <  e/8]/T 

T— vrv-i  Z '  L  *  J 


<  d  +  lim  sup  E  N(Aa,  h*#*,  e/4, 77,  • 

T—>oo  ,  L 


By  the  definition  of  /?,  5  from  Theorem  7.1,  we  have 

1  T  r  r 

[-^(A»,/^*,e/4,X>,7r^Jl  ~  ^ej\  >  R(t  -  l,e/2) 

00  t=i 

1  T 

^  J™  r  (ll%-i)e*  -  ^*11  >  W  ~  1’£/2)) 


T^oo  T 


<  s£/4t1  !,e/2)  =  0. 


Combined  with  (7.5),  this  implies 


lim  sup - — — -  <  <7+ 

T— >■  00  T 


1  a  r  r  1 " 

lim  sup  —  ^2  E  N(Aa,h;e  e/ A,  V,7ig  )l  \\ir*  -  n0it\\  <  R(t  -  l,e/2)  . 

00  7  “  L  *  L  JJ 


For  any  t  <  T,  on  the  event  1 1  vrr>  —  7i0t  |  <  R(t  —  1,  e/2),  we  have  (by  the  property  (7.4) 
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and  a  triangle  inequality) 


E 


N(Aa,h*t^,£/4,V,7igte 
<  E 


7te+ 


N(Aa ,  h*t§m ,  e/4,  Z>,  )|Vj  +  2i2(t  - 1,  e/2) 

=  SC  (A,  e/4,  X>,  +  2i?(t  -  1,  e/2) 

<  SC  (A,  e/4,  X>,  vreJ  +  l/t  +  2i?(t  -  1,  e/2), 


where  the  last  inequality  follows  by  definition  of  A* .  Therefore, 

..  E[St(£)] 

Iim  sup - — - 

T— >  oo  T 


<  d  +  lim  sup  —  ^  SC  (A,  e/4,  V,  7r^)  +  l/t  +  2R(t  -  l,e/2) 


T— >-oo 


t=l 


=  d  +  SC  (A,  s/4,  £>,  7r0J  . 


□ 


7.5  Conclusions 

We  have  shown  that  when  learning  a  sequence  of  i.i.d.  target  concepts  from  a  known  VC  class, 
with  an  unknown  distribution  from  a  known  totally  bounded  family,  transfer  learning  can  lead 
to  amortized  average  sample  complexity  close  to  that  achievable  by  an  algorithm  with  direct 
knowledge  of  the  the  targets’  distribution. 
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Chapter  8 


Prior  Estimation 


Abstract 

'We  study  the  optimal  rates  of  convergence  for  estimating  a  prior  distribution  over  a  VC  class 
from  a  sequence  of  independent  data  sets  respectively  labeled  by  independent  target  functions 
sampled  from  the  prior.  We  specihcally  derive  upper  and  lower  bounds  on  the  optimal  rates 
under  a  smoothness  condition  on  the  correct  prior,  with  the  number  of  samples  per  data  set  equal 
the  VC  dimension.  These  results  have  implications  for  the  improvements  achievable  via  transfer 
learning. 


8.1  Introduction 

In  the  Transfer  learning  setting,  we  are  presented  with  a  sequence  of  learning  problems,  each 
with  some  respective  target  concept  we  are  tasked  with  learning.  The  key  question  in  transfer 
learning  is  how  to  leverage  our  access  to  past  learning  problems  in  order  to  improve  performance 
on  learning  problems  we  will  be  presented  with  in  the  future. 

Among  the  several  proposed  models  for  transfer  learning,  one  particularly  appealing  model 
'joint  work  with  Jaime  Carbonell  and  Steve  Hanneke 
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supposes  the  learning  problems  are  independent  and  identically  distributed,  with  unknown  distri¬ 
bution,  and  the  advantage  of  transfer  learning  then  comes  from  the  ability  to  estimate  this  shared 
distribution  based  on  the  data  from  past  learning  problems  [Baxter,  1997,  Yang,  Hanneke,  and 
Carbonell,  2011].  For  instance,  when  customizing  a  speech  recognition  system  to  a  particu¬ 
lar  speaker’s  voice,  we  might  expect  the  first  few  people  would  need  to  speak  many  words  or 
phrases  in  order  for  the  system  to  accurately  identify  the  nuances.  However,  after  performing 
this  for  many  different  people,  if  the  software  has  access  to  those  past  training  sessions  when 
customizing  itself  to  a  new  user,  it  should  have  identified  important  properties  of  the  speech 
patterns,  such  as  the  common  patterns  within  each  of  the  major  dialects  or  accents,  and  other 
such  information  about  the  distribution  of  speech  patterns  within  the  user  population.  It  should 
then  be  able  to  leverage  this  information  to  reduce  the  number  of  words  or  phrases  the  next  user 
needs  to  speak  in  order  to  train  the  system,  for  instance  by  first  trying  to  identify  the  individual’s 
dialect,  then  presenting  phrases  that  differentiate  common  subpatterns  within  that  dialect,  and  so 
forth. 

In  analyzing  the  benefits  of  transfer  learning  in  such  a  setting,  one  important  question  to  ask 
is  how  quickly  we  can  estimate  the  distribution  from  which  the  learning  problems  are  sampled. 
In  recent  work,  Yang,  Hanneke,  and  Carbonell  [201 1]  have  shown  that  under  mild  conditions  on 
the  family  of  possible  distributions,  if  the  target  concepts  reside  in  a  known  VC  class,  then  it  is 
possible  to  estimate  this  distribtion  using  only  a  bounded  number  of  training  samples  per  task: 
specifically,  a  number  of  samples  equal  the  VC  dimension.  However,  we  left  open  the  question 
of  quantifying  the  rate  of  convergence.  This  rate  of  convergence  can  have  a  direct  impact  on  how 
much  benefit  we  gain  from  transfer  learning  when  we  are  faced  with  only  a  finite  sequence  of 
learning  problems.  As  such,  it  is  certainly  desirable  to  derive  tight  characterizations  of  this  rate 
of  convergence. 

The  present  work  continues  that  of  Yang,  Hanneke,  and  Carbonell  [2011],  bounding  the  rate 
of  convergence  for  estimating  this  distribution,  under  a  smoothness  condition  on  the  distribution. 
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We  derive  a  generic  upper  bound,  which  holds  regardless  of  the  VC  class  the  target  concepts 
reside  in.  The  proof  of  this  result  builds  on  our  earlier  work,  but  requires  several  interesting 
innovations  to  make  the  rate  of  convergence  explicit,  and  to  dramatically  improve  the  upper 
bound  implicit  in  the  proofs  of  those  earlier  results.  We  further  derive  a  nontrivial  lower  bound 
that  holds  for  certain  constructed  scenarios,  which  illustrates  a  lower  limit  on  how  good  of  a 
general  upper  bound  we  might  hope  for  in  results  expressed  only  in  terms  of  the  number  of  tasks, 
the  smoothness  conditions,  and  the  VC  dimension. 

8.2  The  Setting 

Let  (X,  Bx)  be  a  Borel  space  [Schervish,  1995]  (where  X  is  called  the  instance  space),  and 
let  Tbea  distribution  on  X  (called  the  data  distribution).  Let  C  be  a  VC  class  of  measurable 
classifiers  h  :  X  — >  {—1,  +1}  (called  the  concept  space),  and  denote  by  d  the  VC  dimension  of 
C  [Vapnik,  1982].  We  suppose  C  is  equipped  with  its  Borel  cr-algebra  B  induced  by  the  pseudo¬ 
metric  p(h,  g)  =  V({x  E  X  :  h(x)  f  g{x)}).  Though  our  results  can  be  formulated  for  general 
V  (with  somewhat  more  complicated  theorem  statements),  to  simplify  the  statement  of  results 
we  suppose  p  is  actually  a  metric,  which  would  follow  from  appropriate  topological  conditions 
on  C  relative  to  V.  For  any  two  probability  measures  p  \ ,  p2  on  a  measurable  space  (  Q.  X),  define 
the  total  variation  distance 


Hal  -  biW  =  sup  //i (A)  -  p2(A). 

AaT 

Let  n0  =  {"o  :  9  e  0}  be  a  family  of  probability  measures  on  C  (called  priors),  where  0 
is  an  arbitrary  index  set  (called  the  parameter  space).  We  additionally  suppose  there  exists  a 
probability  measure  i r0  on  C  (called  the  reference  measure)  such  that  every  ttq  is  absolutely  con¬ 
tinuous  with  respect  to  7To,  and  therefore  has  a  density  function  fg  given  by  the  Radon-Nikodym 
derivative  g  [Schervish,  1995], 

We  consider  the  following  type  of  estimation  problem.  There  is  a  collection  of  C-valued  ran- 
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dom  variables  {h*t0  :  t  E  N,  9  E  0},  where  for  any  fixed  9  E  0  the  variables  are  i.i.d. 

with  distribution  tt6.  Foreach#  e  0,  there  is  a  sequence  Zt(9)  =  {(2Qi,  Yn(#)),  (Xt2,Yt2(9)), . . 
where  {Xti)tl(zn  are  i.i.d.  V.  and  for  each  t,i  E  N,  Yti(0)  =  h*g(Xti).  We  additionally  denote 
by  Ztk  =  {(Xa,  Yn(6 )), . . . ,  (Xffc,  Ytfc(0))}  the  first  k  elements  of  Zt(0),  for  any  k  E  N,  and 
similarly  Xtk  =  {Xtl, . . . ,  Xtk}  and  ¥ tk(9)  =  { Yn(9 ), . . . ,  Ytk(9)}.  Following  the  terminol¬ 
ogy  used  in  the  transfer  learning  literature,  we  refer  to  the  collection  of  variables  associated 
with  each  t  collectively  as  the  tlh  task.  We  will  be  concerned  with  sequences  of  estimators 
0Te  =  0T(Zik(9), . . . ,  ZTk(9)),  for  T  E  N,  which  are  based  on  only  a  bounded  number  k  of 
samples  per  task,  among  the  first  T  tasks.  Our  main  results  specifically  study  the  case  of  k  —  d. 
For  any  such  estimator,  we  measure  the  risk  as  E  \\"0.rii  —  |  ,  and  will  be  particularly  inter¬ 

ested  in  upper-bounding  the  worst-case  risk  sup^g0  E  \\"(,.rii  —  "or\\  as  a  function  of  T,  and 
lower-bounding  the  minimum  possible  value  of  this  worst-case  risk  over  all  possible  9T  estima¬ 
tors  (called  the  minimax  risk). 

In  previous  work,  Yang,  Hanneke,  and  Carbonell  [2011]  we  showed  that,  if  Id©  is  a  totally 
bounded  family,  then  even  with  only  d  number  of  samples  per  task,  the  minimax  risk  (as  a  func¬ 
tion  of  the  number  of  tasks  T)  converges  to  zero.  In  fact,  we  also  proved  this  is  not  necessarily 
the  case  in  general  for  any  number  of  samples  less  than  d.  However,  the  actual  rates  of  con¬ 
vergence  were  not  explicitly  derived  in  that  work,  and  indeed  the  upper  bounds  on  the  rates  of 
convergence  implicit  in  that  analysis  may  often  have  fairly  complicated  dependences  on  C,  n©, 
and  V,  and  furthermore  often  provide  only  very  slow  rates  of  convergence. 

To  derive  explicit  bounds  on  the  rates  of  convergence,  in  the  present  work  we  specifically 
focus  on  families  of  smooth  densities.  The  motivation  for  involving  a  notion  of  smoothness  in 
characterizing  rates  of  convergence  is  clear  if  we  consider  the  extreme  case  in  which  n©  contains 
two  priors  7Ti  and  7t2,  with  tt,  ({h})  =  tt2 ( { /y } )  =  1,  where  p(h,g)  is  a  very  small  but  nonzero 
value;  in  this  case,  if  we  have  only  a  small  number  of  samples  per  task,  we  would  require  many 
tasks  (on  the  order  of  1  / p(h,  g ))  to  observe  any  data  points  carrying  any  information  that  would 
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distinguish  between  these  two  priors  (namely,  points  x  with  h(x)  f  g(x));  yet  ||7ri  —  7t2 ||  =  1, 
so  that  we  have  a  slow  rate  of  convergence  (at  least  initially).  A  total  boundedness  condition 
on  n0  would  limit  the  number  of  such  pairs  present  in  II0,  so  that  for  instance  we  cannot  have 
arbitrarily  close  h  and  g,  but  less  extreme  variants  of  this  can  lead  to  slow  asymptotic  rates  of 
convergence  as  well. 

Specifically,  in  the  present  work  we  consider  the  following  notion  of  smoothness.  For  L  £ 
(0,  oo )  and  a  £  (0, 1],  a  function  /  :  C  — >•  M  is  (L,  a)-Holder  smooth  if 

Vh,geC,\f(h)-f(g)\<Lp(h,g)a. 


8.3  An  Upper  Bound 


We  now  have  the  following  theorem,  holding  for  an  arbitrary  VC  class  C  and  data  distribution 
T>;  it  is  the  main  result  of  this  work. 

Theorem  8.1.  For  II0  any  class  of  priors  on  C  having  ( L .  a)-Hdlder  smooth  densities  {fg  :  9  £ 
0},  for  any  T  £  N,  there  exists  an  estimator  9Te  =  9T(Zid(9), . . . ,  ZTd{9))  such  that 

~  ( _ a2 

sup  E||7Tg  -  TTflJ  =  O  [  LT  2(d+2a)(a+2(d+l)) 
e*ee  T  V 


Proof  By  the  standard  PAC  analysis  [Blumer,  Ehrenfeucht,  Haussler,  and  Warmuth,  1989,  Vap- 
nik,  1982],  for  any  7  >  0,  with  probability  greater  than  1— 7,  a  sample  of  k  =  0((d/ 7)  log(l/7)) 
random  points  will  partition  C  into  regions  of  width  less  than  7.  For  brevity,  we  omit  the  t  sub¬ 
script  on  quantities  such  as  Ztk(9)  throughout  the  following  analysis,  since  the  claims  hold  for 
any  arbitrary  value  of  t. 

For  any  9  £  0,  let  n'e  denote  a  (conditional  on  A"  1 ,  Xk)  distribution  defined  as  follows. 
Let  f'g  denote  the  (conditional  on  A"  1 .... ,  Xk)  density  function  of  71 Jg  with  respect  to  no,  and  for 
any  g  £  C,  let  fe(g )  =  ggfgggg %j=g(£j}j  (or  0  if  M{h  ^  C  :  Vi  <  k,  h(Xf)  =  g(Xi)})  = 
0).  In  other  words,  n'0  has  the  same  probability  mass  as  ne  for  each  of  the  equivalence  classes 
induced  by  X1: . . . ,  Xk,  but  conditioned  on  the  equivalence  class,  simply  has  a  constant-density 
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distribution  over  that  equivalence  class.  Note  that,  by  the  smoothness  condition,  with  probability 
greater  than  1  —  7,  we  have  everywhere 


\fe(h)  —  fb(h)\  <  Lya. 

So  for  any  6,  O'  G  0,  with  probability  greater  than  1  —  7, 

he  -  fl-fl'H  =  (1/2)  J  | fe  ~  fe'\dn0  <  Lya  +  (1/2)  J  \fg  -  fg,\dTT0. 

Furthermore,  since  the  regions  that  define  fg  and  fgl  are  the  same  (namely,  the  partition  induced 
by  Xi, . . . ,  Xk),  we  have 

(1/2)  J  \fe  ~  fe'\dno 

=  (1/2)  ^2  We({h  eC:\/i<k,  h(Xi )  =  2/*})  -  n  -0>{{h  G  C  :  Vi  <  k,  fi(2Q)  =  y,})  | 

j/iv.ad-t+i} 

=  ||PYfc(0)|Xfc  -  PYfc(0')|Xfc||- 

Thus,  we  have  that  with  probability  at  least  1  —  7, 

Ike  -  7T0/||  <  Lya  +  ||PYfc(0)|xfc  -  PYfc(0')|xfe||- 

Following  analogous  to  the  inductive  argument  of  Yang,  Hanneke,  and  Carbonell  [2011], 
suppose  I  C  {1, . . . ,  k},  fix  27  G  and  yj  G  (—1,  +1}^.  Then  the  yj  G  {  —  1,  +1}I7I  for 
which  no  h  G  C  has  h(xj)  =  yi  for  which  | \yj  —  yj\\i  is  minimal,  has  \\yj  —  y/||i  <  d  +  1,  and 
for  any  i  G  /  with  y,:  k  Vi,  letting  y'  =  yj  for  j  E  I  \  {(}  and  y'  =  y(,  we  have 

Pya^PA^/Iat)  =  pYA{i}(e)|xA{i}kAwkAFl)  ~  Pya^^/Iat), 

and  similarly  for  6',  so  that 

|Py/(0)|X/(27j|2a)  -  PY/(0')|X/(y/k/)| 

^  lpYAO}(0)|xA{i}kAF}kA{d)  “  PY/\{i} (fi') |X/\{i}  k AW ) I 

+  |Pya0)Pa(£/I^)  -  pY7(0')|X/(y/k/)l- 
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Now  consider  that  these  two  terms  inductively  define  a  binary  tree.  Every  time  the  tree  branches 
left  once,  it  arrives  at  a  difference  of  probabilities  for  a  set  /  of  one  less  element  than  that  of  its 
parent.  Every  time  the  tree  branches  right  once,  it  arrives  at  a  difference  of  probabilities  for  a 
i)i  one  closer  to  an  unrealized  yj  than  that  of  its  parent.  Say  we  stop  branching  the  tree  upon 
reaching  a  set  I  and  a  Tjr  such  that  either  yr  is  an  unrealized  labeling,  or  \I\  =  d.  Thus,  we 
can  bound  the  original  (root  node)  difference  of  probabilities  by  the  sum  of  the  differences  of 
probabilities  for  the  leaf  nodes  with  |/|  =  d.  Any  path  in  the  tree  can  branch  left  at  most  k  —  d 
times  (total)  before  reaching  a  set  /  with  only  d  elements,  and  can  branch  right  at  most  d  +  1 
times  in  a  row  before  reaching  a  yi  such  that  both  probabilities  are  zero,  so  that  the  difference  is 
zero.  So  the  depth  of  any  leaf  node  with  \I\  =  d  is  at  most  ( k  —  d)d.  Furthermore,  at  any  level 
of  the  tree,  from  left  to  right  the  nodes  have  strictly  decreasing  |/|  values,  so  that  the  maximum 
width  of  the  tree  is  at  most  k  —  d.  So  the  total  number  of  leaf  nodes  with  |/|  =  d  is  at  most 
( k  —  d)2d.  Thus,  for  any  y  G  {1, . . . ,  k}  and  x  G  Xk, 

|PYfc(0)|Xfc(t/|S)  -  PYfe(0')|xfc(y|^)| 

<(k~  d)2d  •  max  max  |PYd(0)|xd(£rf|TD)  -  P¥d(0')|xd(^z))|- 

yd(z{  — 1,+1}“ 

Since 


||PYfc(0)|Xfc  -  PYfc(0')|xJ  =  (1/2)  ^  lFYfc(0)| Xk(yk)  -  P¥fc(0')|Xfc(t/fc)[j 

yk&{-l,+l}k 

and  by  Sauer’s  Lemma  this  is  at  most 

(ek)d  \¥YkmXk(f)  -¥Yk{e,)lXk(yk)\, 

yKE{— 1,+1}* 

we  have  that 


-P- 


Yfc(0')|Xfc| 


<  (ek) 


dk2d  max  max 
yd£{- l,+l}d  De{l,...,k}d 


l)  -P 


Yd(0')|XD 


(^)l- 
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Thus,  we  have  that 


Ike  -  7T0/ 1|  =  E\\ne  -  7T0/|| 

<7  +  L7“  +  (efc)d£;2dE 

Note  that 


max  max 
yde{- i,+i}d  De{i,...,k}d 


lFYd(6»)|XD(k)  -  ^Yd(9')\XD(yd)\ 


E 

< 


max  max 

2/de{-i,+i}d  De{i,...,k}d 


E  E  E 

yd€{- i,+i}d  De{i,...,k}d 


d)  ~  ^Yd(o')\xD(yd)\ 

)  -  pYd(e')|xDkd)|] 


<  (2k)d  max  max  E  [|PY<J(e)|xD(j/d)  ^FYd(0')|xD(/)|]  , 


yde{- i,+i}d  -De{i,...,fc}d 

and  by  exchangeability,  this  last  line  equals 


(2k)d  max  E  [|PYd(0)|xd(?/d)  -  pYd(0')|xd(?/d)|]  • 

ya£{  —  lj+l}® 

Yang,  Hanneke,  and  Carbonell  [2011]  showed  that 

E  OpYd(0)|xd(k)  -  FYd(e')|xd(^d)|]  <  4y//||P.zd(0)  -  P zd(9') 

so  that  in  total  we  have 


IN  -  //.'ll  <  (i  +  1)7°  +  4(2e4)M+yi|P^(»)-P2iOT||. 

Plugging  in  the  value  of  k  =  c(d/7)  log(l/7),  this  is 

(L  +  1)7"  +  4  ^2ec- log  yi|P^d(0)  -  p2rd(0')||- 

So  the  only  remaining  question  is  the  rate  of  convergence  of  our  estimate  of  FZd(o*).  If  N(e) 
is  the  e-covering  number  of  {P.zd(0)  :  9  e  0},  then  taking  0Tor  as  the  minimum  distance  skele¬ 
ton  estimate  of  Devroye  and  Lugosi  [2001],  Yatracos  [1985]  achieves  expected  total  variation 
distance  e  from  7^,  for  some  T  =  0((l/e2)  logiV(e/4)).  We  can  partition  C  into  0((L/e)d/a ) 
cells  of  diameter  0((e / L)1^),  and  set  a  constant  density  value  within  each  cell,  on  an  0(e)-grid 
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of  density  values,  and  every  prior  with  (L,  o  )-Hdldcr  smooth  density  will  have  density  within 
e  of  some  density  so-constructed;  there  are  then  at  most  ( l/£)0((L/£)d/“ )  such  densities,  so  this 
bounds  the  covering  numbers  of  lie.  Furthermore,  the  covering  number  of  n0  upper  bounds 
N(e)  [Yang,  Hanneke,  and  Carbonell,  2011],  so  that  N(e)  <  (1/ e)0^L/^d/a\ 

Solving  T  =  0(e~2(L/£)dta  log(l/e))  for  e,  we  have  e  =  O  (^L  d+2a^  _  So  this 

bounds  the  rate  of  convergence  for  E||Pzd(0T)  —  P^d(e»*)||,  for  ()  j  the  minimum  distance  skeleton 
estimate.  Plugging  this  rate  into  the  bound  on  the  priors,  combined  with  Jensen’s  inequality,  we 
have 

Elks,  -  n,  II  <  (L  +  1)7“  +  4  (2eCd-  log  (2)  )  ^  O  (i  “+‘"  j  . 

This  holds  for  any  7  >  0,  so  minimizing  this  expression  over  7  >  0  yields  a  bound  on  the  rate. 
For  instance,  with  7  —  O  (t~  2(d+2,»)(«-2(<i+l  j ,  we  have 

E||7TgT  —  7TeJ|  =  O  f  LT~  2(d+2a)(a+2(d+l))  \  _ 


□ 


8.4  A  Minimax  Lower  Bound 

One  natural  quesiton  is  whether  Theorem  8.1  can  generally  be  improved.  While  we  expect  this  to 
be  true  for  some  fixed  VC  classes  (e.g.,  those  of  finite  size),  and  in  any  case  we  expect  that  some 
of  the  constant  factors  in  the  exponent  may  be  improvable,  it  is  not  at  this  time  clear  whether 
the  general  form  of  7"-0O2/(d-i-a)2)  js  sometimes  optimal.  One  way  to  investigate  this  question  is 
to  construct  specific  spaces  C  and  distributions  D  for  which  a  lower  bound  can  be  obtained.  In 
particular,  we  are  generally  interested  in  exhibiting  lower  bounds  that  are  worse  than  those  that 
apply  to  the  usual  problem  of  density  estimation  based  on  direct  access  to  the  h*0ir  values  (see 
Theorem  8.3  below). 

Here  we  present  a  lower  bound  that  is  interesting  for  this  reason.  However,  although  larger 
than  the  optimal  rate  for  methods  wtih  direct  access  to  the  target  concepts,  it  is  still  far  from 
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matching  the  upper  bound  above,  so  that  the  question  of  tightness  remains  open.  Specifically,  we 
have  the  following  result. 

Theorem  8.2.  For  any  integer  d  >  1,  any  L  >  0,  a  G  (0, 1],  there  is  a  value  C(d ,  L ,  a)  G  (0,  oo) 
such  that,  for  any  TeN,  there  exists  an  instance  space  X,  a  concept  space  C  of  VC  dimension 
d,  a  distribution  V  over  X,  and  a  distribution  i r0  over  C  such  that,  for  IIq  a  set  of  distributions 
over  C  with  ( L,a)-Holder  smooth  density  functions  with  respect  to  i r0,  any  estimator  dT  = 
0T(Zld(0*), ZTd(0*))  (T  =  1,2,.. .),  has 

sup  K  \\~f,  -  n7,J|  >  C(d,  L,  a)T_2(d+“) . 
e+ee 

Proof  (Sketch)  We  proceed  by  a  reduction  from  the  task  of  determining  the  bias  of  a  coin  from 
among  two  given  possibilities.  Specifically,  fix  any  7  G  (0, 1/2),  n  G  N,  and  let  B^p), ,  Bn (p) 
be  i.i.d  Bernoulli (p)  random  variables,  for  each  p  G  [0,1];  then  it  is  known  that,  for  any  (possibly 
nondeterministic)  decision  rule  pn  :  {0, 1}"  — >  {(1  +  7)/2,  (1  —  7)/2}, 

^  lP(Pn(-Bi(p)»--,-Bn(p))  ^p)  >  (1/32)  -exp{-12872n/3}  .  (8.1) 

pe{(l+7)/2,(l-7)/2} 

This  easily  follows  from  the  results  of  Bar-Yossef  [2003],  Wald  [1945],  combined  with  a  result 
of  Poland  and  Hutter  [2006]  bounding  the  KL  divergence. 

To  use  this  result,  we  construct  a  learning  problem  as  follows.  Fix  some  m  G  N  with  m  >  d, 
let  X  =  {1, . . . ,  m},  and  let  C  be  the  space  of  all  classifiers  h  :  X  — >  {  —  1,  +1}  such  that 
\{x  E  X  :  h(x )  =  +1}|  <  d.  Clearly  the  VC  dimension  of  C  is  d.  Define  the  distribution  V 
as  uniform  over  X .  Finally,  we  specify  a  family  of  (L,  o  ) -Holder  smooth  priors,  parameterized 
by©  =  {-i,+i}(”),  as  follows.  Let  ym  =  (L/2)(l/m)a.  First,  enumerate  the  ("')  distinct 
d-sized  subsets  of  {1, ... ,  m}  as  X\ ,  X2, . . . ,  X/m\.  Define  the  reference  distribution  i r0  by  the 
property  that,  for  any  h  G  C,  letting  q  =  \{x  :  h(x)  =  +1}|,  vr0({fi})  =  (^)d / (™) • 
For  any  b  =  (b1} . . . ,  b^)  G  {—1, 1} ( d ) ,  define  the  prior  7rb  as  the  distribution  of  a  random 
variable  /?b  specified  by  the  following  generative  model.  Let  Uniform({l, . . . ,  (7)})>  let 
Cb(i*)  ~  Bernoulli((l  +  ymbi*)/2);  finally,  hh  ~  Uniform ({fi  G  C  :  {x  :  h(x )  =  +1}  C 
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Xi*,  Parity (| {x  :  h(x)  =  +1}|)  =  Cb(i*)}),  where  Parity  (n)  is  1  if  n  is  odd,  or  0  if  n  is  even. 
We  will  refer  to  the  variables  in  this  generative  model  below.  For  any  h  e  C,  letting  II  = 
{x  :  h(x)  =  +1}  and  q  =  \H\,  we  can  equivalently  express  7r b({/i})  =  (^)d(™)  Xa=i  C 
Xi](l  +  7m6j)Panty^(l  —  7m&j)1_Parity^.  From  this  explicit  representation,  it  is  clear  that,  letting 
/h  =  we  have  /b(/i)  €  [1  —  ym,  1  +  ym|  for  all  he  C.  The  fact  that  /b  is  Holder 
smooth  follows  from  this,  since  every  distinct  h,  g  e  C  have  V({x  :  h(x )  7^  >  1/m  = 

(2  7m/P)1/Q. 


Next  we  set  up  the  reduction  as  follows.  For  any  estimator  ttt  =  7Tt(1Z] j(YP), . . . ,  Zjdi/h)), 
and  each  i  e  {1,. . . ,  (”4)},  let  h,  be  the  classifier  with  {x  :  hi(x )  =  +1}  =  X%\  also,  if 
^r({hi})  >  (^)d/(™)>  lei  S*  =  2Parity(d)  —  1,  and  otherwise  6,:  =  1  —  2Parity(d).  We 
use  these  bt  values  to  estimate  the  original  b,  values.  Specifically,  let  p,  =  (1  +  pmbl)/2  and 
Pi  =  (1  +  lmbi)/%  where  b  =  6>*.  Then 


ll*r  -  7rflJ|  >  (1/2) 

i=l 


/m\ 


>  (1/2)  E  -  i.l/2  =  (!/2)  E  7/sjlP.  - ftl- 

Thus,  we  have  reduced  from  the  problem  of  deciding  the  biases  of  these  ("')  independent 
Bernoulli  random  variables.  To  complete  the  proof,  it  suffices  to  lower  bound  the  expectation 
of  the  right  side  for  an  arbitrary  estimator. 

Toward  this  end,  we  in  fact  study  an  even  easier  problem.  Specifically,  consider  an  estimator 
qi  =  q.;(Z\<i(JI>)-  . . . ,  ZTd{0*),  zf),  where  i*t  is  the  i*  random  variable  in  the  generative 

model  that  defines  h*t(K ;  that  is,  i*  ~  Uniform({l, . . . ,  (”1)}),  Ct  ~  Bernoulli((l  +  7 mbi*)/2), 
and  h*6t  ~  Uniform ({/i  e  C  :  {x  :  h(x)  =  +1}  C  Xq,  Parity (|{x  :  h(x)  =  +1}|)  = 
Ct}),  where  the  i*t  are  independent  across  t,  as  are  the  Ct  and  h*()* .  Clearly  the  p,  from  above 
can  be  viewed  as  an  estimator  of  this  type,  which  simply  ignores  the  knowledge  of  i*.  The 
knowledge  of  these  i*t  variables  simplifies  the  analysis,  since  given  {i*  \  t  <  T},  the  data 
can  be  partitioned  into  ("')  disjoint  sets,  {{Ztrf(0*)  :  i*  —  1}  :  i  —  1, . . (™)},  and  we  can 
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use  only  the  set  (Ztd(0*)  :  i*  =  i}  to  estimate  p*.  Furthermore,  we  can  use  only  the  subset 
of  these  for  which  Xt(i  =  X,,  since  otherwise  we  have  zero  information  about  the  value  of 
Parity ( | {x  :  h}0Jx)  =  +1}|).  That  is,  given  i*t  =  i,  any  Ztd(0+)  is  conditionally  independent 
from  every  bj  for  j  ^  i,  and  is  even  conditionally  independent  from  bi  when  ~Ktd  is  not  completely 
contained  in  Xt\  specihcally,  in  this  case,  regardless  of  bi,  the  conditional  distribution  of  Xld{0.k) 
given  i*  =  i  and  given  Xtd  is  a  product  distribution,  which  deterministically  assigns  label  —1  to 
those  1^(0*)  with  Xtk  cf  X,,  and  gives  uniform  random  values  to  the  subset  of'  YLd(9t)  with  their 
respective  Xtk  e  X, .  Finally,  letting  rt  =  Parity ( | {A;  <  d  :  Ytk(9±)  =  +1}|),  we  note  that  given 
i*t  =  i,  ~Ktd  =  Xi,  and  the  value  rt,  bi  is  conditionally  independent  from  Ztd(0+).  Thus,  the  set  of 
values  C,,t(0+)  =  {rt  :  i*t  —  i,  Xtd  =  Xi}  is  a  sufficient  statistic  for  b,  (hence  for  p,).  Recall  that, 
when  i*t  =  i  and  ~Ktd  =  Xn  the  value  of  rt  is  equal  to  Ct,  a  Bernoulli  (p,)  random  variable.  Thus, 
we  neither  lose  nor  gain  anything  (in  terms  of  risk)  by  restricting  ourselves  to  estimators  <p  of 
the  type  qt  =  ^(Zld(6»*), . . . ,  ZTd(6>*),  i\,  ...,i*T)  =  g'(QT(6>*)),  for  some  q[  [Schervish,  19951: 
that  is,  estimators  that  are  a  function  of  the  NiT(0+)  =  |  Bernoulli  Qy)  random  variables, 

which  we  should  note  are  conditionally  i.i.d.  given  A^T(0*). 

Thus,  by  (8.1),  for  any  n  <  T, 

^  ~  Pi\  =  n  =  7)  ^  Pi  =  U ) 

&*€{— 1,+1}  bi£{  — 1,+1} 

>  (7-/32)  •  exp  {-1287^/3}  . 

Also  note  that,  for  each  i,  E[iVj]  =  T  <  ( d/m)2dT  =  d2d(2rym/ L)2d^aT,  so  that  Jensen’s 

(  d  ) 

inequality,  linearity  of  expectation,  and  the  law  of  total  expectation  imply 

\  E  E  [I®  -  PiW  >  (7-/32)  •  exp  (—43(2/ L)2d/a d2d^‘^~2d/°lT }  . 

bi£{  — 1,+1} 
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Thus,  by  linearity  of  the  expectation. 


1\  W' 
2 


E  E 


E .  &  i' 


1  1  ^ 
odT^Vb  ^ 


be{-i,+i}vd' 

>  (7m/ (32  •  2d))  •  exp  {-43(2/L)M/ad2d7^+2d/aT} 


-  2d( n!)  2 

=i  Vd/  6,-et-i 


'Pi  I 


In  particular,  taking 


we  have 


so  that 


m  = 


(L/2)1/Q  (43(2/L)2rf/“rfM 


2(d+a) 


.  .  fA3(2/L)2d/ad2d\2^ 

lm  .  =  &  |  I  - ^ - 


l\  ^ 
2 


E  E 


^  ■  2^7) 


ct  ' 

=  Q  |  2-“  ^(2/^)2d/^2d\  2(d+“} 


T 


be{-i,+i}vdi 

In  particular,  this  implies  there  exists  some  b  for  which 


E 


=  02 


_d  /  13(2//;)2rf/'7/2,,\  2(d+“> 


l 


T 


Applying  this  lower  bound  to  the  estimator  pi  defined  above  yields  the  result. 


□ 


In  the  extreme  case  of  allowing  arbitrary  dependence  on  the  data  samples,  we  merely  recover 
the  known  results  lower  bounding  the  risk  of  density  estimation  from  i.i.d.  samples  from  a 
smooth  density,  as  indicated  by  the  following  result. 

Theorem  8.3.  For  any  integer  d  >  1,  there  exists  an  instance  space  X,  a  concept  space  C  of 
VC  dimension  d,  a  distribution  V  over  X,  and  a  distribution  7To  over  C  such  that,  for  If©  the 
set  of  distributions  over  C  with  ( L .  a)-Ho!der  smooth  density  functions  with  respect  to  7T0,  any 
sequence  of  estimators,  9t  =  0t{Zi(O*), . . . ,  Zr(Of)  (T  =  1,  2, . . .),  has 

SUp  E  [||7Tg  -  TTflj]  =  O  (t~^\  . 

6»*ee  v  J 
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The  proof  is  a  simple  reduction  from  the  problem  of  estimating  717^  based  on  direct  access  to 
h*0t , ,  h  *TOt ,  which  is  essentially  equivalent  to  the  standard  model  of  density  estimation,  and 
indeed  the  lower  bound  in  Theorem  8.3  is  a  well-known  result  for  density  estimation  from  T  i.i.d. 
samples  from  a  Holder  smooth  density  in  a  d-dimensional  space  [see  e.g.,  Devroye  and  Lugosi, 
2001], 

8.5  Future  Directions 

There  are  several  interesting  questions  that  remain  open  at  this  time.  Can  either  the  lower  bound 
or  upper  bound  be  improved  in  general?  If,  instead  of  d  samples  per  task,  we  instead  use  m>  d 
samples,  how  does  the  minimax  risk  vary  with  ml  Related  to  this,  what  is  the  optimal  value  of 
m  to  optimize  the  rate  of  convergence  as  a  function  of  rnT,  the  total  number  of  samples?  More 
generally,  if  an  estimator  is  permitted  to  use  N  total  samples,  taken  from  however  many  tasks  it 
wishes,  what  is  the  optimal  rate  of  convergence  as  a  function  of  N1 
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Chapter  9 


Estimation  of  Priors  with  Applications  to 
Preference  Elicitation 


Abstract 

'We  extend  the  work  of  [Yang,  Hanneke,  and  Carbonell,  2013]  on  estimating  prior  distributions 
over  VC  classes  to  the  case  of  real-valued  functions  in  a  VC  subgraph  class.  We  then  apply  this 
technique  to  the  problem  of  maximizing  customer  satisfaction  using  a  minimal  number  of  value 
queries  in  an  online  preference  elicitation  scenario. 


9.1  Introduction 

Consider  an  online  travel  agency,  where  customers  go  to  the  site  with  some  idea  of  what  type  of 
travel  they  are  interested  in;  the  site  then  poses  a  series  of  questions  to  each  customer,  and  iden¬ 
tifies  a  travel  package  that  best  suits  their  desires,  budget,  and  dates.  There  are  many  options  of 
travel  packages,  with  options  on  location,  site-seeing  tours,  hotel  and  room  quality,  etc.  Because 
of  this,  serving  the  needs  of  an  arbitrary  customer  might  be  a  lengthy  process,  requiring  many 
lrThis  chapter  is  based  on  joint  work  with  Steve  Hanneke 
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detailed  questions.  Fortunately,  the  stream  of  customers  is  typically  not  a  worst-case  sequence, 
and  in  particular  obeys  many  statistical  regularities:  in  particular,  it  is  not  too  far  from  reality 
to  think  of  the  customers  as  being  independent  and  identically  distributed  samples.  With  this 
assumption  in  mind,  it  becomes  desirable  to  identify  some  of  these  statistical  regularities  so  that 
we  can  pose  the  questions  that  are  typically  most  relevant,  and  thereby  more  quickly  identify 
the  travel  package  that  best  suits  the  needs  of  the  typical  customer.  One  straightforward  way 
to  do  this  is  to  directly  estimate  the  distribution  of  customer  value  functions,  and  optimize  the 
questioning  system  to  minimize  the  expected  number  of  questions  needed  to  find  a  suitable  travel 
package. 

One  can  model  this  problem  in  the  style  of  Bayesian  combinatorial  auctions,  in  which  each 
customer  has  a  value  function  for  each  possible  bundle  of  items.  However,  it  is  slightly  differ¬ 
ent,  in  that  we  do  not  assume  the  distribution  of  customers  is  known,  but  rather  are  interested  in 
estimating  this  distribution;  the  obtained  estimate  can  then  be  used  in  combination  with  methods 
based  on  Bayesian  decision  theory.  In  contrast  to  the  literature  on  Bayesian  auctions  (and  subjec¬ 
tivist  Bayesian  decision  theory  in  general),  this  technique  is  able  to  maintain  general  guarantees 
on  performance  that  hold  under  an  objective  interpretation  of  the  problem,  rather  than  merely 
guarantees  holding  under  an  arbitrary  assumed  prior  belief.  This  general  idea  is  sometimes  re¬ 
ferred  to  as  Empirical  Bayesian  decision  theory  in  the  machine  learning  and  statistics  literatures. 
The  ideal  result  for  an  Empirical  Bayesian  algorithm  is  to  be  competitive  with  the  corresponding 
Bayesian  methods  based  on  the  actual  distribution  of  the  data  (assuming  the  data  are  random, 
with  an  unknown  distribution);  that  is,  although  the  Empirical  Bayesian  methods  only  operate 
with  a  data-based  estimate  of  the  distribution,  the  aim  is  to  perform  nearly  as  well  as  methods 
based  on  the  true  (unobservable)  distribution.  In  this  work,  we  present  results  of  this  type,  in  the 
context  of  an  abstraction  of  the  aforementioned  online  travel  agency  problem,  where  the  measure 
of  performance  is  the  expected  number  of  questions  to  find  a  suitable  package. 

The  technique  we  use  here  is  rooted  in  the  work  of  [Yang,  Hanneke,  and  Carbonell,  2013]  on 
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transfer  learning  with  a  VC  class.  The  component  of  that  work  of  interest  here  is  the  estimation 
of  prior  distributions  over  VC  classes.  Essentially,  there  is  a  given  class  of  functions,  from  which 
a  sequence  of  functions  is  sampled  i.i.d.  according  to  an  unknown  distribution.  We  observe 
a  number  of  values  of  each  of  these  functions,  evaluated  at  points  chosen  at  random,  and  are 
then  tasked  with  estimating  the  distribution  of  these  functions.  This  is  more  challenging  than 
the  traditional  problem  of  nonparametric  density  estimation,  since  we  are  not  permitted  direct 
access  to  these  functions,  but  rather  only  a  limited  number  of  evaluations  of  the  function  (i.e., 
a  number  of  (x,  fix))  pairs).  The  work  of  [Yang,  Hanneke,  and  Carbonell,  2013]  develops  a 
technique  for  estimating  the  distribution  of  these  functions,  given  that  the  functions  are  binary¬ 
valued,  the  class  of  functions  has  finite  VC  dimension,  and  the  class  of  distributions  is  totally 
bounded.  In  this  work,  we  extend  this  technique  to  classes  of  real-valued  functions  having  finite 
pseudo-dimension,  a  natural  generalization  of  VC  dimension  for  real-valued  functions  [Haussler, 
1992], 

The  specific  application  we  are  interested  in  here  may  be  expressed  abstractly  as  a  kind  of 
combinatorial  auction  with  preference  elicitation.  Specifically,  we  suppose  there  is  a  collection 
of  items  on  a  menu,  and  each  possible  bundle  of  items  has  an  associated  fixed  price.  There  is 
a  stream  of  customers,  each  with  a  valuation  function  that  provides  a  value  for  each  possible 
bundle  of  items.  The  objective  is  to  serve  each  customer  a  bundle  of  items  that  nearly-maximizes 
his  or  her  surplus  value  (value  minus  price).  However,  we  are  not  permitted  direct  observation 
of  the  customer  valuation  functions;  rather,  we  may  query  for  the  value  of  any  given  bundle  of 
items;  this  is  referred  to  as  a  value  query  in  the  literature  on  preference  elicitation  in  combinato¬ 
rial  auctions  (see  Chapter  14  of  [Cramton,  Shoham,  and  Steinberg,  2006],  [Zinkevich,  Blum,  and 
Sandholm,  2003]).  The  objective  is  to  achieve  this  near-maximal  surplus  guarantee,  while  mak¬ 
ing  only  a  small  number  of  queries  per  customer.  We  suppose  the  customer  valuation  function 
are  sampled  i.i.d.  according  to  an  unknown  distribution  over  a  known  (but  arbitrary)  class  of  real¬ 
valued  functions  having  finite  pseudo-dimension.  Reasoning  that  knowledge  of  this  distribution 
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should  allow  one  to  make  a  smaller  number  of  value  queries  per  customer,  we  are  interested  in 
estimating  this  unknown  distribution,  so  that  as  we  serve  more  and  more  customers,  the  number 
of  queries  per  customer  required  to  identify  a  near-optimal  bundle  should  decrease.  In  this  con¬ 
text,  we  in  fact  prove  that  in  the  limit,  the  expected  number  of  queries  per  customer  converges 
to  the  number  required  of  a  method  having  direct  knowledge  of  the  true  distribution  of  valuation 
functions. 


9.2  Notation 

Let  B  denote  a  a- algebra  on  X  x  E,  let  Bx  denote  the  rx-algcbra  on  X.  Also  let  p(h,g)  = 
f  \h  —  g\dPx,  where  Px  is  a  marginal  distribution  over  X .  Let  T  be  a  class  of  functions  X  — >  M 
with  Borel  rr-algcbra  Bjr  induced  by  p.  Let  0  be  a  set  of  parameters,  and  for  each  9  e  0,  let  ne 
denote  a  probability  measure  on  iT .  Bj).  We  suppose  {hq  :  9  e  0}  is  totally  bounded  in  total 
variation  distance,  and  that  T  is  a  uniformly  bounded  VC  subgraph  class  with  pseudodimension 
d.  We  also  suppose  p  is  a  metric  when  restricted  to  T . 

Let  be  i.i.d.  Px  random  variables.  For  each  9  e  0,  let  be  i.i.d.  hq  random 

variables,  independent  from  {Xu}t,iGN-  For  each  t  6  N  and  9  e  0,  let  Yti{9)  =  h*te(Xti )  for 
i  G  N,  and  let  Zt{9)  =  {(Xtl,  Ytl(9)),  (Xt2,  Ya(9)), . . Xt  =  {Xtl,  Xt2, . . .},  and  Yt(0)  = 
{Yn{0),  Yt2{9 ), . . .};  for  each  fceN,  define  Ztk(9)  =  {(Xtl,  Yn{9)), . . . ,  (Xtk,  Ytk(9))},  Xtk  = 
{Xn, . .  • ,  Xtk},  and  Y tk{9)  =  { Ytl(6 ), . . . ,  Ytk{9)}. 

For  any  probability  measures  p,  p!,  we  denote  the  total  variation  distance  by 

|| p  —  p  ||  =  sup  p(A)  —  p'(A), 

A 

where  A  ranges  over  measurable  sets. 

Lemma  9.1.  For  any  9,9'  e  0  and  t  e  N, 


W^e  -  tte'W  —  11^(0)  -  ^zpe') 
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Proof  Fix  0,0'  G  0,  t  G  N.  Let  X  =  {Xn,  Xt2, . . Y{6)  =  {Ytl(0),Yt2(0), . . and  for 
k  G  N  let  Xfc  =  {Xtl, Xtk}.  and  Yk(6)  =  {Yn(0), Ytk(0)}.  For  h  G  T,  let  cx(h)  = 
{(Xtl,h(Xtl)),(Xt2,h(Xt2)),...}. 

For  h,g  G  T ,  define  px(h,g)  =  lim  A  I Kxu)  ~  g(Xti)\  (if  the  limit  exists),  and 

m— >■  oo 

pXk{h,g )  =  f  //( AV)  —  r/( A",,) |.  Note  that  since  T  is  a  uniformly  bounded  VC  subgraph 

class,  so  is  the  collection  of  functions  (| h  —  g\  :  h,  g  G  J7},  so  that  the  uniform  strong  law  of 
large  numbers  implies  that  with  probability  one,  Vfi,  g  G  A,  px(fi,  (?)  exists  and  has  pj(h.  g)  = 
p(h,g )  [Vapnik,  1982]. 

Consider  any  6*'  G  0,  and  any  A  G  £>j-.  Then  any  h  ^  A  has  Vg  G  A,  p(h,  g)  >  0  (by  the 
metric  assumption).  Thus,  if  px(h,  g)  =  p(h,  g)  for  all  h,  g  G  J7,  then  Vfi  ^  A, 

Wg  G  A,  px(h,  g)  =  p(h,g )  >  0  =>•  Mg  G  A,cx(h )  ^  cx(g)  =>  cx(h )  ^  cx(A). 

This  implies  rv1  (cx(A))  =  A.  Under  these  conditions, 

P^t(6»)|x(cx(A))  =  7re(cx1(cx(A)))  =  v r0(A), 

and  similarly  for  0'. 

Any  measurable  set  C  for  the  range  of  Z,  (0)  can  be  expressed  as  C  —  {cx(h)  :  (h,  x)  G  C'} 
for  some  appropriate  6"  G  Bjr  0  U'J.  Letting  6''-.  =  {/?. :  (Ip  x)  G  6"},  we  have 

P2,«»(C)  =  J  Mcpppcppvpdx)  =  I  *e(C',)Vx(dx)  =  P(k;„x)(C'). 

Likewise,  this  reasoning  holds  for  O' .  Then 


II ^Zt(O)-  ^Zt(e oil  ~  II^Vvx)  -  „2 


=  sup 


MC'S)  -  MC',)Wxm 


< 


sup  1 7 r0(A)  -  7re/(A)|Px(dx)  =  ||7r0  —7 r<?/ 

AGBjt 


Since  h*t0  and  X  are  independent,  for  A  G  Bjr,  7re(A)  =  P/,,*fl(A)  =  P/l*g(A)Px(A'00)  = 
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P(/i*fl)X)(v4  x  A°°).  Analogous  reasoning  holds  for  h*g/.  Thus,  we  have 

Ik*  -  Ml  =  l*<*jq(-  X  X°°)  -  P(fttV,x)(-  X  *°°)|| 

<  Iklk,*,,!!)  -  P(*,V,X)||  =  I|Pj2,(«)  -  P^l(»')ll' 

Combining  the  above,  we  have  1 1 P^rt (6>)  —  P_zt(0')ll  =  Ike  —  □ 

Lemma  9.2.  There  exists  a  sequence  rk  =  o(l)  such  that,  Vi,  k  E  N,  V(9 .  6'  G  0, 

W ztk{e)  ~  ^ ztk{e')\\  <  \\^e-^e'\\  <  W ztk{e)  ~  ^ ztk{e')\\  +  r k. 

Proof  This  proof  follows  identically  to  a  proof  of  [Yang,  Hanneke,  and  Carbonell,  2013],  but  is 
included  here  for  completeness.  Since  P ztk(e){A)  =  P zt(0)(A  x  [X  x  R)°°)  for  all  measurable 
A  C  (X  x  M)fc,  and  similarly  for  O',  we  have 

IIP Ztk(6)  -  ^ztk{6')\\  =  Sup  FZtk^)(A)  -P ztk(0')(A) 

AeBk 

=  sup  F2 W){A  x  (A  x  R)°°)  -  P Zt(0')(A  X  (A  x  R)°°) 

AeBk 

<  sup  P zt(0)(A)  -P zt(0')(A)  =  ||P^t(0)  -P^t(0')ll) 

AeB°° 

which  implies  the  left  inequality  when  combined  with  Lemma  9.1. 

Next,  we  focus  on  the  right  inequality.  Fix  6,6'  G  0  and  7  >  0,  and  let  B  G  B°°  be  such  that 

Ike  -  ^e'll  =  l|P^t(e)  _  P-zt(0')ll  <  P zt(e){B)  -  P zt{p>){B)  +  7. 

Let  A  =  {A  x  (X  x  M)°°  :  A  G  k  G  N}.  Note  that  A  is  an  algebra  that  generates  B°°. 
Thus,  Caratheodory’s  extension  theorem  [Schervish,  1995]  implies  that  there  exist  disjoint  sets 
{Aj};eN  in  A  such  that  B  C  J?;gN  A*  and 

P zt{o){B)  -P zt{o'){B)  <  ^P zt{o){Ai)  -  ^P zt(e'){Ai)  +7. 

ieN  ieN 

Since  these  A*  sets  are  disjoint,  each  of  these  sums  is  bounded  by  a  probability  value,  which 
implies  that  there  exists  some  n  G  N  such  that 

n 

^P zt{0)(Ai)  <  7  +  y^P zt(0)(Ai), 

ieN  i= 1 
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which  implies 


y^Pztte^A)  _  <  7  +  y^P^t(e)(A)  -  y^P^(y)(A) 

ieN  ieN  i=l  i=l 


=  7  + 


-  P-Zt(0') 


As  1J"=]  Aj  £  A,  there  exists  m  G  N  and  measurable  Drn  £  £>"'  such  that  1J"=1  /l,  =  Brn  x  (A  x 
M)°°,  and  therefore 


P-zt(0) 


<  l|P.Ztm(0)  -^tm(0')ll  ^  ,lim  Wztk{9)  “P Ztk{8 ') 


Combining  the  above,  we  have  \\ne  -  vi>||  <  lim*.^  ||P Ztk{e)  ~  ^ztk{e')\\  +  37-  By  letting  7 
approach  0,  we  have 

he  ~  7t<HI  <  lim  ||PZtfc(0)  -  WZtkie,)\\. 

k, — ^00 

So  there  exists  a  sequence  rk(9,  9')  =  o(l)  such  that 

V/c  £  N,  II7T0  -  7T0/||  <  ||Pztfc(0)  -P^fc^oll  +  rfc(M')- 

Now  let  7  >  0  and  let  07  be  a  minimal  7-cover  of  0.  Define  the  quantity  rk( 7)  =  max<g<j/e07  rk(6,  9'). 
Then  for  any  9,  9'  £  0,  let  07  =  argmin0„ee_  \\ne  —  nen\\  and  9 7  =  argmin0„ee?  || t\qi  —  hq"\\. 
Then  a  triangle  inequality  implies  that  V/c  £  N, 

he  —  no'  ||  <  he  ~  ^  1 1  +  he-,  ~  ^ll  +  hoiy  ~  ^e'W 

<  27  +  r k(91, 6>7)  +  ||P.ztfc(07)  -  P^tfc(0;)|| 

<  27  +  rk (7 )  +  ||P2tfc(fg)  -  Pztfc(^)||- 
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Triangle  inequalities  and  the  left  inequality  from  the  lemma  statement  (already  established)  imply 


<  ||P.Ztfc(07>  -  ^Ztk{d)  ||  +  \\^Ztk(9)  -  ^Ztk{d')  ||  +  II lPztfc(0g  -  ^Ztk(6')  || 

<  lke7  -  M\  +  llp^tfc(0)  -  P^teoll  +  hei,  -  ^|| 

<  2y  +  ||P^fc(0)  -  ^Ztk(d')\\- 

So  in  total  we  have 

he  ~  Tty  ||  <  4y  +  rfe (7 )  +  ||Pztfc(0)  -  P^fc(0')||- 

Since  this  holds  for  all  7  >  0,  defining  r>.  =  inf7>0(47  +  /'/>■  (7 ) ) ,  we  have  the  right  inequality 
of  the  lemma  statement.  Furthermore,  since  each  rk(9,0')  =  o(l),  and  |07|  <  00,  we  have 
rk( 7)  =  o(l)  for  each  7  >  0,  and  thus  we  also  have  rk  =  o(l).  □ 

Lemma  9.3.  Vf,  k  7  N,  there  exists  a  monotone  function  Mk(x )  =  o(l)  such  that,  'if).  9’  e  0, 

hztk(9)  ~  ^Ztk(6')\\  <  Mk  (||P^d(0)  -  Pztd(0')||)  • 

Proof.  Fix  any  t  e  N,  and  let  X  =  {Xtl,Xt2, . . .}  and  Y (6)  =  {Ya(9),Yt2(9), . . and  for 
k  e  N  let  Xfc  =  {Xtl, . . . ,  Xtk}  and  Y  k(9)  =  {Ytl{9), . . . ,  Ytk{9)}. 

If  k  <d,  then  P ztk(e)h  =  ^ztd(e)(-  x  (X  x  {-1,  +l})d_fc),  so  that 

l|Pztfc(0)  —  P^tfc(0')||  <  IIP^W  ~^Ztd(9')  II, 

and  therefore  the  result  trivially  holds. 

Now  suppose  k  >  d.  Fix  any  7  >  0,  and  let  B0  (y  C  [X  x  R)k  be  a  measurable  set  such  that 

p ztk(e){Bo,o')  -  P ztk{e')(B0 .0/)  <  ||P ztk{$)  ~  ^ztk{6')\\ 

<  P ztk(0)(Be,0>)  -  P ztk(e>){Bgt0')  +  7. 

By  Caratheodory’s  extension  theorem,  there  exists  a  disjoint  sequence  of  sets  { such  that 

OO  OO 

I -  P <  7  +  -  X  p  Ztk(9')(Bi), 
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and  such  that  each  B^O,  O')  is  representable  as  follows;  for  some  £j( 0 ,  O')  e  N,  and  sets  Cl3  = 
(Aij  1  x  (-00, tiji})  x  •••  x  (Aijk  x  (-00 ,tijk}),  for  j  <  4(0,0'),  where  each  G  &x, 
the  set  Bi(9,  O')  is  representable  as  Uses,  Dy= J  Dijs,  where  S'*  C  {0, . . . ,  _  i|;  each 

e  {C4,  C'fj},  and  s  ^  s'  ^  Vtj=f)DHs  n  Dj=i' ^  DHs'  =  0-  Since  the  Bi(0,0')  are 
disjoint,  the  above  sums  are  bounded,  so  that  there  exists  777^(0,0',  7)  G  N  such  that  every 
m  >  rrik(0,  O',  7)  has 


^ztk{e)(Befi,)  -  P ztk{e')(Be,e>) 

m  m 

<  27  + 

2=1  2=1 

Now  define  Mk{ 7)  =  max^/^  mk(0,  O',  7).  Then  for  any  0,0'  e  0,  let  07,  0^  e  07  be 
such  that  ||7r0  -  77UI  <  7  and  ||7r©/  -  vr^||  <  7,  which  implies  ||P2tfc(e)  -  Pztfc(07)||  <  7  and 
||P^(0')  -  Pztfc(«r)||  <  7  by  Lemma  9.2.  Then 


l|P-ZtfcW  _  F^tfc(f')ll  <  llFZtfc(0)  _  F-Ztk(<^)ll  +  27 

<  P ztk(ey){Bei^)  -  P ztk(e'1)(Be1^)  +  37 

Mfc(T) 

2=1 

Again,  since  the  5,  (07,  0'  )  are  disjoint,  this  equals 


Mfc(7) 


'Mfc(T) 


57  +  Pztfc(07)  ^  U  BiQyWj  -p^(^)  ^  U 

/Mfc(7)  \  /Mfc(  7) 

<77  +  Pz„(»)  U  U 


2=1 


2=1 


Mfc(7) 


77+  v  pz„(,)(b, (»„«;)) -pZMV ,(»,(«„«;)) 


2=1 


<77  +  Mfc(7)  max  |Px,l„)(Bi(91,«;))-P2tl(#,)(i3i(97,«;))| 

z<Mk(y) 


Thus,  if  we  can  show  that  each  P 'Ztk{p){Bi{01,  0'  ))  —  Pztfc(e')(-Si(07,  07))  is  bounded  by  a  o(l) 
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function  of  ||P_2td(e)  —  ^ztd{0')  ||,  then  the  result  will  follow  by  substituting  this  relaxation  into  the 
above  expression  and  defining  Mk  by  minimizing  the  resulting  expression  over  7  >  0. 

Toward  this  end,  let  Ct]  be  as  above  from  the  definition  of  Bi{91 ,  9^),  and  note  that  IB 757,0' ) 
is  representable  as  a  function  of  the  ICij  indicators,  so  that 


|rWB<(«v>  O)  ~  O 


<  2*<(W  max  E 


beJ 


j<£J 


\  jeJ 

/ 

<  2^(6>7^7) 

E 

in 

cc. 

b-' 

^i(Q 7^7) 

max 

JC{l,...,2£i(6lT'’Sb} 

—  ^*(^7^7) 

max 

-in  )n(  1  -  Icu(Z<t(0')) 


E 


E 


J6 J  j&J 

X\icMtkm -\{ic,,)ztk(ff)) 


UeJ 


jeJ 


P-Ztfc(u)  ( n  _  ^(00  ( n 


beJ 


VjeJ 


Note  that  Hjgj  Cy  can  he  expressed  as  some  (Ak  x  (— oo,  fi])  x  •  •  •  x  (Ak  x  (— oo,  ffc]),  where 
each  Ap  e  £1*  and  tp  e  M,  so  that,  letting  f  =  max0!0/e07  maxKig,7j  ^(0,  0')  and  =  {(Ai  x 
(— oo,  t\])  x  •  •  •  x  (Ak  x  (— oo,  tk})  :  Vj  <  k,  Aj  e  Bx,tk  €  M},  this  last  expression  is  at  most 


4*  sup  |PWC)-Pstt(,)(C')|. 
c&ck 

Next  note  that  for  any  C  =  (v4i  x  ( — oo, ti])  x  •••  x  ( Ak  x  (— oo,4])  e  Ck,  letting  C\  = 
Axx  ■  ■  ■  x  Ak  and  C2  =  (— oo,  fi]  x  •  •  •  x  (— oo,  tk]. 


IP 'ztk(d)(C)  -  P Ztk(6')(C)  -  E  [(P¥tfc(6>)|Xtfc  (<^2)  -  IPVtfc(0')|Xtfc(C2))  Ic^tk)] 

<  E  [  |PYtfc(6»)|Xtfc  (<^2)  -  IP>Ytj,(0')|Xtfc  (C2)  |  ]  • 
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For  p  G  {1, . . . ,  k},  let  C2p  =  (— oo,tp\.  Then  note  that,  by  definition  of  d,  for  any  given 
x  =  (xi, . . . ,  xk ),  the  class  7-Lx  =  {xp  hg  Ic2P(h(xP))  ■  h  G  J -}  is  a  VC  class  over  {xi, . . . ,  xk} 
with  VC  dimension  at  most  d.  Furthremore,  we  have 


FYtfc(0)|xtfc(C,2)  -  Pytfc(0')|xtfc(C'2)| 

=  ^(iC21(h*g(xa)),...jC2k(h*te(xtk))) |xtfc({(l,  •••,!)}) 

~  PCc21FV(x*i))>-’/c2fcFVF*fc)))|xtfc({(1>  ■  •  •  >  !)})  • 

Therefore,  the  results  of  [Yang,  Hanneke,  and  Carbonell,  2013]  (in  the  proof  of  their  Lemma  3) 
imply  that 


PYtfc(0)|Xtfc(C2)  -  PYtfc(0')|Xtfc(C2) 


<  2k  max  max 
?/e{o,i}d  De{i,...,k}d 


PCc,  ( hle(xtj))}jeD\{Xtj}jeD  (M) 

-  F{Ic2j  (h*te, (Xti))}j€D\ {Xtj}jeD({y}) 


Thus,  we  have 


E  [|P¥ffc(0)|Xtfe(C2)  -  Pytfc(0')|Xtfc(C,2)|] 

<  2fcE  |  max  max  \^{iC2.(h;e(xtj))}  jeD\{Xtj}j£D  (M) 


max  max 
y£{ 0,l}d  De{l,...,fc}d 


-  r{Ic2j  (h*te, (Xtj))}jeD\{Xtj}jeD  (M) 


<2‘  E  E  E 

ye{ 0,l}d  De{l,...,k}d 


01 te i^tj ) ) }j e d  | { Xt j }jeD  (M) 

“  (h*te,(Xtj))}jeD \{Xtj}jeD  (M) 


<  2 d+kkd  max  max  E 

y&{ 0,l}d  De{l,...,k}d 


^>{Ic2j{hle(Xtj))}jeD\{Xtj}jeD  (M) 


^ ){IC2j{hler(Xtj))}jeD\{Xtj}jeD  (M) 
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Exchangeability  implies  this  is  at  most 


2 a+Kka  max  sup  E 
ye{o,i}d  eK 


P 


\h-°°,tAh*tg(xtjm<i=1\xtd({y}) 


P 


'{J(-oc  ,tj]  {h*tg,  {Xtj))}dj= 1 1  Xtd  (  {?/}  ) 


=  2  +  k  max  sup  E 
j/e{o,i}d  k 


P 


P 


[Yang,  Hanneke,  and  Carbonell,  2013]  argue  that  for  all  y  E  {0,  l}d  and  ti, . , . ,  td  E 


E 


%(-,,] (r« w)>$=i  ixtd  (M)  -  p{/(-oo,f,.]  (Ytj m)}U  ix^ 

<  4 


-oo,*<](y«W)}y=i.XM  jrU(-oo,ti](y«(e,))}J=i 


P/ 


Noting  that 


llPU(-oo,tJ](y«W)}^=iAd  PU(-oo,til(ir«(e'))}f=i.x«dll  -  II Fztd(e)  W>ztd(0')  || 

completes  the  proof.  □ 

We  can  use  the  above  lemmas  to  design  an  estimator  of  i\(K .  Specifically,  we  have  the  follow¬ 
ing  result. 

Theorem  9.4.  There  exists  an  estimator  0rf)r  =  9T(Zld{9f), . . .  ,ZTd(0±)),  and  functions  R  : 
N0  x  (0, 1]  — y  [0,  00)  and  6  :  N0  x  (0, 1]  — >  [0, 1]  such  that,  for  any  a  >  0,  lim  R(T,  a)  = 

T-»  00 

lim  S(T,  a)  =  0  and  for  any  T  E  N0  and  6 *  E  0, 

T— >■  00 

P  (ll^rs*  -  ^*11  >  R(T,a))  <  S(T,a)  <  a. 

Proof.  The  estimator  9to,  we  will  use  is  precisely  the  minimum-distance  skeleton  estimate  of 
P.ztd(0*)  [Devroye  and  Lugosi,  2001,  Yatracos,  1985].  [Yatracos,  1985]  proved  that  if  N(e)  is 
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the  e-covering  number  of  {PZtd^  :  6  e  0},  then  taking  this  0rot  estimator,  then  for  some 

Te  =  0{{  1/e2)  log  N(e/ 4)),  any  T  >T£  has 

E  _\\^Ztd(eTeJ  -  ^Ztd{9*)\\_  <  e- 

Thus,  taking  GT  —  infje  >  0  :  T  >  T£},  we  have 

E  II Fztd(eTeJ  ~  P^(e*)H  <gt  =  o{  1). 

Letting  R'(T,  a)  be  any  positive  sequence  with  Gt  <C  R!(T ,  a)<l  and  R'(T,  a)  >  Gt/ol ,  and 
letting  S(T,  a)  =  Gt/R'(T,  a )  =  o(l),  Markov’s  inequality  implies 

p  (||P^(»„,,  -  P^(».)ll  >  K(T,  a))  <  S(T,a)  <  a.  (9.1) 

Letting  R{T ,  a)  =  minfc  (Mk  ( R!(T ,  a))  +  rfc),  since  a)  =  o(l)  and  rk  =  o(l),  we  have 
R(T,  a)  =  o(l).  Furthermore,  composing  (9.1)  with  Lemmas  9.1,  9.2,  and  9.3,  we  have 

P  (\\ndTei,  ~  ^ll  >  R(T,a))  <  ${T,a)  <  a. 

□ 


Remark:  Although  the  above  result  makes  use  of  the  minimum-distance  skeleton  estimator, 
which  is  typically  not  computationally  efficient,  it  is  often  possible  to  achieve  this  same  result 
(for  certain  families  of  distributions)  using  a  simpler  estimator,  such  as  the  maximum  likelihood 
estimator.  All  we  require  is  that  the  risk  of  the  estimator  converges  to  0  at  a  known  rate  that 
is  independent  of  0*.  For  instance,  see  [van  de  Geer,  2000b]  for  conditions  on  the  family  of 
distributions  sufficient  for  this  to  be  true  of  the  maximum  likelihood  estimator. 


9.3  Maximizing  Customer  Satisfaction  in  Combinatorial  Auc¬ 
tions 

We  can  use  Theorem  9.4  in  the  context  of  various  applications.  For  instance,  consider  the  fol¬ 
lowing  application  to  the  problem  of  serving  a  sequence  of  customers  so  as  to  maximize  their 
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satisfaction. 


Suppose  there  is  a  menu  of  n  items  [n]  =  {1 , ,n},  and  each  bundle  B  C  [n]  has  an 
associated  price  p(B )  >  0.  Suppose  also  there  is  a  sequence  of  customers,  each  with  a  valuation 
function  vt  :  2^  — >•  R.  We  suppose  these  vt  functions  are  i.i.d.  samples.  We  can  then  calculate 
the  satisfaction  function  for  each  customer  as  st(x),  where  x  G  {0,  l}n,  and  st(x)  =  vt(Bx)  — 
p(Bx),  where  Bx  C  [n]  contains  element  i  G  [n]  iff =  1. 

Now  suppose  we  are  able  to  ask  each  customer  a  number  of  questions  before  serving  up  a 
bundle  BXl  to  that  customer.  More  specifically,  we  are  able  to  ask  for  the  value  st(x)  for  any 
x  G  {0,  l}n.  This  is  referred  to  as  a  value  query  in  the  literature  on  preference  elicitation  in 
combinatorial  auctions  (see  Chapter  14  of  [Cramton,  Shoham,  and  Steinberg,  2006],  [Zinkevich, 
Blum,  and  Sandholm,  2003]).  We  are  interested  in  asking  as  few  questions  as  possible,  while 
satisfying  the  guarantee  that  E[st(xt)  —  max,.  <  e. 

Now  suppose,  for  every  n  and  e,  we  have  a  method  A{x,  e)  such  that,  given  that  7 r  is  the  actual 
distribution  of  the  st  functions,  A{it.  e)  guarantees  that  the  xt  value  it  selects  has  E  [maxx  s,(x)  — 
$t(xi)]  <  e;  also  let  Nt(n,  e)  denote  the  actual  (random)  number  of  queries  the  method  A(ir,  e) 
would  ask  for  the  st  function,  and  let  Qix,  e)  =  E[A^(tt,  c)].  We  suppose  the  method  never 
queries  any  st(x)  value  twice  for  a  given  t,  so  that  its  number  of  queries  for  any  given  t  is 
bounded. 

Also  suppose  IF  is  a  VC  subgraph  class  of  functions  mapping  X  =  (0, 1}”  into  [—1, 1]  with 
pseudodimension  d,  and  that  {"0  :  6  G  0}  is  a  known  totally  bounded  family  of  distributions 
over  F  such  that  the  st  functions  have  distribution  n()t_  for  some  unknown  0„  G  0.  For  any  0  G  0 

and  7  >  0,  let  B(6>,  7)  =  {9'  G  0  :  ||7re  —  7re/||  <  7}. 

Suppose,  in  addition  to  A,  we  have  another  method  A'(e )  that  is  not  7r-dependent,  but  still 
provides  the  e-correctness  guarantee,  and  makes  a  bounded  number  of  queries  (e.g.,  in  the 
worst  case,  we  could  consider  querying  all  2n  points,  but  in  most  cases  there  are  more  clever 
7r-independent  methods  that  use  far  fewer  queries,  such  as  0{  1/e2)).  Consider  the  following 
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method;  the  quantities  0To+ ,  R(T,  a),  and  <){T.  a)  from  Theorem  9.4  are  here  considered  with 
respect  Px  taken  as  the  uniform  distribution  on  {0,  l}n. 

Algorithm  2  An  algorithm  for  sequentially  maximizing  expected  customer  satisfaction, 
for  t  =  1,  2, . . . ,  T  do 

Pick  points  XfA ,  Xt2, . . . ,  XU]  uniformly  at  random  from  {0,  l}n 

if  R(t  —  1,  e/2)  >  e/8  then 
Run  A'(e) 

Take  xt  as  the  returned  value 
else 

Let  6t9ir  e  B  R(t  —  1,  e/2)  j  be  such  that 

Q(^ete  X/4)  <  ,,  min  Q(ne,  e/4)  +  1/t 

Run  A(7Tgt9  ,  e/4)  and  let  xt  be  its  return  value 

end  if 
end  for 


The  following  theorem  indicates  that  this  method  is  correct,  and  furthermore  that  the  long- 
run  average  number  of  queries  is  not  much  worse  than  that  of  a  method  that  has  direct  knowledge 
Of  7%. 

Theorem9.5.  For  the  above  method,  Vt  <  T,  E[maxa;  st(x)  —  <  e.  Furthermore,  ifSx^e) 

is  the  total  number  of  queries  made  by  the  method,  then 

limsup  <  Q( 7r^,  e/4)  +  d. 

T— >-oo  -L 

Proof.  By  Theorem  9.4,  for  any  t  <  T,  if  R(t  —  l,e/2)  <  e/8,  then  with  probability  at  least 
1  —  e/2,  ||7T0*  —  ttq  ||  <  R(t  —  1,  e/2),  so  that  a  triangle  inequality  implies  ||7r<^  —  irgtg  ||  < 
2R(t  —  1,  e/2)  <  e/4.  Thus, 


E 


maxst(i)  -  st{xt) 

X 


<  E 


E 


maxs^xj 

.  X 
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For  6  e  0,  let  xtg  denote  the  point  x  that  would  be  returned  by  ,  e/4)  when  queries  are 

answered  by  some  stg  ~  instead  of  st  (and  supposing  st  =  stgj.  If  ||7Tg  —  7piJ|  <  e/4,  then 


E 


max St(x)  -  st(xt ) 


te* 


=  E 


max stet (x)  -  st0t(xt ) 


<  E 


mA*steteSx)  -  stet6Sxtem 


(%flJ  ^0*  +  -  *o*\\  <  £/4  +  £/4  =  £/2- 


Plugging  into  (9.2),  we  have 

E  maxst(x)  —  st(xt)  <  e. 

.  X 

For  the  result  on  St(s),  first  note  that  R(t  —  1,  c/2)  >  e/8  only  finitely  many  times  (due 
to  R(t,  a)  =  o(l)),  so  that  we  can  ignore  those  values  of  t  in  the  asymptotic  calculation  (as  the 
number  of  queries  is  always  bounded),  and  rely  on  the  correctness  guarantee  of  A'  for  correct¬ 
ness.  For  the  remaining  t  values,  let  Nt  denote  the  number  of  queries  made  by  A(ng  ,  e/4 ). 
then 

nST(e)) 


lim  sup 

T— »oo  T 


< 


d  +  lim  sup  ^  E  [Nt\  /T. 


T— ?>oo 


t=  1 


Since 


1 


lim  —  V  E 

T- s>oo  T 


Q(t- 1)6* 


7T0J|  >  R(t  -  l,e/2)] 


t= i 


/E2”p(K.-«.. -’r».n>-R(f-1.£/2)) 

t= i 
1  T 

<2n  lim  -£<$(i-l,e/2)  =  0, 


t=  i 


we  have 


T 


limsup  E  [iVt]  /T  =  limsup  —  E  -  7r0J  <  f?(f  -  1,  e/2)] 


T^-°o  t=1  T->oo  T  i=1 


For  any  t  <  T,  let  Nt(9tg *)  denote  the  number  of  queries  ,  e/4)  would  make  if  queries 

were  answered  with  stQ  instead  of  st.  On  the  event  ||7r( 


9(t_ 1)e*  -  7T0*  II  <  #(*  -  M/2),  we  have 


E 


N, 


te»* 


<  E 


Nti&te*)  dte*  +  2i?(f  —  l,e/2) 

=  Q( n§tg  , e/4)  +  2R(t  -  l,e/2)  <  g(vr^,e/4)  +  2R(t  -  l,e/2)  +  l/t. 
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Therefore, 


lim  sup  —  E 


T—>  oo 


t=  1 


Af.lllk, 


T 


7T0*  ||  <  -  1,  s/2)] 


< 


Q(^*,e/4)  +  lim  sup  -  ^  2f?(f  -  1,  e/2)  +  1/t  =  Q(7rfl*,e/4). 


T— >■  oo 


t=l 


□ 


Note  that  in  many  cases,  this  result  will  even  continue  to  hold  with  an  infinite  number  of 
goods  (n  =  oo),  since  the  general  results  of  the  previous  section  have  no  dependence  on  the 
cardinality  of  the  space  X . 
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Chapter  10 


Active  Learning  with  a  Drifting 
Distribution 


Abstract 

We  study  the  problem  of  active  learning  in  a  stream-based  setting,  allowing  the  distribution  of 
the  examples  to  change  over  time.  We  prove  upper  bounds  on  the  number  of  prediction  mistakes 
and  number  of  label  requests  for  established  disagreement-based  active  learning  algorithms,  both 
in  the  realizable  case  and  under  Tsybakov  noise.  We  further  prove  minimax  lower  bounds  for 
this  problem. 


10.1  Introduction 

Most  existing  analyses  of  active  learning  are  based  on  an  i.i.d.  assumption  on  the  data.  In  this 
work,  we  assume  the  data  are  independent,  but  we  allow  the  distribution  from  which  the  data 
are  drawn  to  shift  over  time,  while  the  target  concept  remains  fixed.  We  consider  this  problem 
in  a  stream-based  selective  sampling  model,  and  are  interested  in  two  quantities:  the  number  of 
mistakes  the  algorithm  makes  on  the  first  T  examples  in  the  stream,  and  the  number  of  label 
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requests  among  the  first  T  examples  in  the  stream. 

In  particular,  we  study  scenarios  in  which  the  distribution  may  drift  within  a  fixed  totally 
bounded  family  of  distributions.  Unlike  previous  models  of  distribution  drift  [Bartlett,  1992, 
Koby  Crammer  and  Vaughan,  20101,  the  minimax  number  of  mistakes  (or  excess  number  of 
mistakes,  in  the  noisy  case)  can  be  sublinear  in  the  number  of  samples. 

We  specifically  study  the  classic  CAL  active  learning  strategy  [Cohn,  Atlas,  and  Ladner, 
1994b]  in  this  context,  and  bound  the  number  of  mistakes  and  label  requests  the  algorithm  makes 
in  the  realizable  case,  under  conditions  on  the  concept  space  and  the  family  of  possible  distribu¬ 
tions.  We  also  exhibit  lower  bounds  on  these  quantities  that  match  our  upper  bounds  in  certain 
cases.  We  further  study  a  noise-robust  variant  of  CAL,  and  analyze  its  number  of  mistakes  and 
number  of  label  requests  in  noisy  scenarios  where  the  noise  distribution  remains  fixed  over  time 
but  the  marginal  distribution  on  X  may  shift.  In  particular,  we  upper  bound  these  quantities  un¬ 
der  Tsybakov’s  noise  conditions  [Mammen  and  Tsybakov,  1999].  We  also  prove  minimax  lower 
bounds  under  these  same  conditions,  though  there  is  a  gap  between  our  upper  and  lower  bounds. 

10.2  Definition  and  Notations 

As  in  the  usual  statistical  learning  problem,  there  is  a  standard  Borel  space  X,  called  the  instance 
space,  and  a  set  C  of  measurable  classifiers  h  :  X  — >  {  —  1,  +1},  called  the  concept  space.  We 
additionally  have  a  space  D  of  distributions  on  X,  called  the  distribution  space.  Throughout,  we 
suppose  that  the  VC  dimension  of  C,  denoted  d  below,  is  finite. 

For  any  /j,u  /j2  G  D,  let  ||/xi  —  /x2||  =  supA  Hi(A)  —  n2(A)  denote  the  total  variation  pseudo¬ 
distance  between  // 1  and  fi2,  where  the  set  A  in  the  sup  ranges  over  all  measurable  subsets  of  X. 
For  any  e  >  0,  let  Oe  denote  a  minimal  e-cover  of  D,  meaning  that  De  C  ©  and  V/ii  G  D,  3/i2  G 
©e  s.t.  W^i  —  ii-2\\  <  e,  and  that  Be  has  minimal  possible  size  |Be|  among  all  subsets  of  D  with 
this  property. 

In  the  learning  problem,  there  is  an  unobservable  sequence  of  distributions  Xfi,  V2, . . with 
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each  Vt  G  O,  and  an  unobservable  time-independent  regular  conditional  distribution,  which 
we  represent  by  a  function  77  :  X  — >  [0, 1].  Based  on  these  quantities,  we  let  Z  =  {(Xt,  Yj)}jE 
denote  an  inhnite  sequence  of  independent  random  variables,  such  that  Vt,  Xt  ~  T>t,  and  the  con¬ 
ditional  distribution  of  Yt  given  Xt  satisfies  \/x  G  X,  P  (Yj  =  +l\Xt  =  x)  =  rj(x).  Thus,  the  joint 
distribution  of  (Xt,  Yt)  is  specified  by  the  pair  (Vt,  77) ,  and  the  distribution  of  Z  is  specified  by 
the  collection  {Vt}^  along  with  //.  We  also  denote  by  Zt  =  {(Xi,  Yj),  (X2,  Yj), . . . ,  (Xt,  Yj)} 
the  first  t  such  labeled  examples.  Note  that  the  77  conditional  distribution  is  time-independent, 
since  we  are  restricting  ourselves  to  discussing  drifting  marginal  distributions  on  X,  rather  than 
drifting  concepts.  Concept  drift  is  an  important  and  interesting  topic,  but  is  beyond  the  scope  of 
our  present  discussion. 

In  the  active  learning  protocol,  at  each  time  t,  the  algorithm  is  presented  with  the  value 
Xt,  and  is  required  to  predict  a  label  Yj  G  {— 1,  +1};  then  after  making  this  prediction,  it  may 
optionally  request  to  observe  the  true  label  value  Yj;  as  a  means  of  book-keeping,  if  the  algorithm 
requests  a  label  Yt  on  round  f,  we  define  Qt  =  1,  and  otherwise  Qt  =  0. 

We  are  primarily  interested  in  two  quantities.  The  first,  Mt  =  ELiI  lYt^Yt  Lis  the 
cumulative  number  of  mistakes  up  to  time  T.  The  second  quantity  of  interest,  QT  =  E/=i  Qt, 
is  the  total  number  of  labels  requested  up  to  time  T.  In  particular,  we  will  study  the  expectations 
of  these  quantities:  MT  =  E  MT  and  QT  =  E  QT  .  We  are  particularly  interested  in  the 
asymptotic  dependence  of  QT  and  MT  —  Aff  on  T,  where  Aff  =  in f/,Gr  E  En=i  ^  7^  Yj]  . 

We  refer  to  QT  as  the  expected  number  of  label  requests,  and  to  MT  —  A/f  as  the  expected  excess 
number  of  mistakes.  For  any  distribution  P  on  X,  we  define  er P(h)  =  E_v~/'[//(A"jI[//(A")  = 
—  1]  +  (1  —  ri(X))I[h(X)  =  +1]],  the  probability  of  h  making  a  mistake  for  X  ~  P  and  Y  with 
conditional  probability  of  being  +1  equal  rj(X).  Note  that,  abbreviating  er t(h)  =  er vt{h)  = 
P (h(Xt)  ^  Yj),  we  have  =  inf^eC  ELi  ert(^)- 

Scenarios  in  which  both  Mt  —  ATf  and  Qt  are  o{T )  (i.e.,  sublinear)  are  considered  desirable, 
as  these  represent  cases  in  which  we  do  “learn”  the  proper  way  to  predict  labels,  while  asymp- 
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totically  using  far  fewer  labels  than  passive  learning.  Once  establishing  conditions  under  which 
this  is  possible,  we  may  then  further  explore  the  trade-off  between  these  two  quantities. 

We  will  additionally  make  use  of  the  following  notions.  For  V  C  C,  let  diamt(V)  = 
SUP  h,gev'Dt({x  :  h(x)  ±  g(x)}).  For  h  :  X  ->•  {-1,  +1},  eis:t(h)  =  Et=s  eTu(h), 

and  for  finite  S  C  X  x  {  —  1,  +1},  er(/i;  S)  =  ^  E(x  y)es  lj\-  Also  let  C [S]  =  {h  £ 

C  :  er (h;  S )  =  0}.  Finally,  for  a  distribution  P  on  X  and  r  >  0,  define  B p(h,r)  =  {g  G  C  : 
P(x  :  h(x)  ^  g(x))  <  r}. 

10.2.1  Assumptions 

In  addition  to  the  assumption  of  independence  of  the  Xt  variables  and  that  d  <  oo,  each  result 
below  is  stated  under  various  additional  assumptions.  The  weakest  such  assumption  is  that  D  is 
totally  bounded,  in  the  following  sense.  For  each  e  >  0,  let  denote  a  minimal  subset  of  D 
such  that  VD  e  D,  3V'  e  s.t.  \\V  —  V\\  <  e :  that  is,  a  minimal  e-cover  of  D.  We  say  that  D 
is  totally  bounded  if  it  satisfies  the  following  assumption. 

Assumption  10.1.  Ve  >  0,  |De|  <  oo. 

In  some  of  the  results  below,  we  will  be  interested  in  deriving  specific  rates  of  convergence. 
Doing  so  requires  us  to  make  stronger  assumptions  about  D  than  mere  total  boundedness.  We 
will  specifically  consider  the  following  condition,  in  which  c,  m  £  [0,  oo)  are  constants. 
Assumption  10.2.  Ve  >  0,  |Oe|  <  c  •  e~m. 

For  an  example  of  a  class  D  satisfying  the  total  boundedness  assumption,  consider  X  = 
[0,  l]n,  and  let  D  be  the  collection  of  distributions  that  have  uniformly  continuous  density  func¬ 
tion  with  respect  to  the  Lebesgue  measure  on  X,  with  modulus  of  continuity  at  most  some  value 
c e(e)  for  each  value  of  e  >  0,  where  u(e)  is  a  fixed  real-valued  function  with  lirn^o  a;(e)  =  0. 

As  a  more  concrete  example,  when  u>(e)  =  Le  for  some  L  £  (0,  oo),  this  corresponds  to  the 
family  of  Lipschitz  continuous  density  functions  with  Lipschitz  constant  at  most  L.  In  this  case, 
we  have  |De|  <  O  (e-n),  satisfying  Assumption  10.2. 
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10.3  Related  Work 


We  discuss  active  learning  under  distribution  drift,  with  fixed  target  concept.  There  are  several 
branches  of  the  literature  that  are  highly  relevant  to  this,  including  domain  adaptation  [Mansour, 
Mohri,  and  Rostamizadeh,  2008,  2009],  online  learning  [Littlestone,  1988],  learning  with  con¬ 
cept  drift,  and  empirical  processes  for  independent  but  not  identically  distributed  data  [van  de 
Geer,  2000a], 

Streamed-based  Active  Learning  with  a  Fixed  Distribution  [Dasgupta,  Kalai,  and  Mon- 
teleoni,  2009]  show  that  a  certain  modified  perceptron-like  active  learning  algorithm  can  achieve 
a  mistake  bound  0(dlog(T))  and  query  bound  0(d\og(T)),  when  learning  a  linear  separator 
under  a  uniform  distribution  on  the  unit  sphere,  in  the  realizable  case.  [Dekel,  Gentile,  and  Srid- 
haram,  2010]  also  analyze  the  problem  of  learning  linear  separators  under  a  uniform  distribution, 

-  ~  /  2a  2  \ 

but  allowing  Tsybakov  noise.  They  find  that  with  QT  =  O  (  da+2T «+2  j  queries,  it  is  possible  to 

-  ~  /  a  +  1  1  \ 

achieve  an  expected  excess  number  of  mistakes  MT  —  =  O  (  da+2  ■  T a+2  J .  At  this  time,  we 

know  of  no  work  studying  the  number  of  mistakes  and  queries  achievable  by  active  learning  in  a 
stream-based  setting  where  the  distribution  may  change  over  time. 

Stream-based  Passive  Learning  with  a  Drifting  Distribution  There  has  been  work  on  learn¬ 
ing  with  a  drifting  distribution  and  fixed  target,  in  the  context  of  passive  learning.  [Bartlett,  1992, 
Barve  and  Long,  1997]  study  the  problem  of  learning  a  subset  of  a  domain  from  randomly  cho¬ 
sen  examples  when  the  probability  distribution  of  the  examples  changes  slowly  but  continually 
throughout  the  learning  process;  they  give  upper  and  lower  bounds  on  the  best  achievable  prob¬ 
ability  of  misclassification  after  a  given  number  of  examples.  They  consider  learning  problems 
in  which  a  changing  environment  is  modeled  by  a  slowly  changing  distribution  on  the  product 
space.  The  allowable  drift  is  restricted  by  ensuring  that  consecutive  probability  distributions  are 
close  in  total  variation  distance.  However,  this  assumption  allows  for  certain  malicious  choices  of 
distribution  sequences,  which  shift  the  probability  mass  into  smaller  and  smaller  regions  where 
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the  algorithm  is  uncertain  of  the  target’s  behavior,  so  that  the  number  of  mistakes  grows  linearly 
in  the  number  of  samples  in  the  worst  case.  More  recently,  [Freund  and  Mansour,  1997]  have 
investigated  learning  when  the  distribution  changes  as  a  linear  function  of  time.  They  present 
algorithms  that  estimate  the  error  of  functions,  using  knowledge  of  this  linear  drift. 


10.4  Active  Learning  in  the  Realizable  Case 


Throughout  this  section,  suppose  C  is  a  fixed  concept  space  and  h*  E  C  is  a  fixed  target  function: 
that  is,  ert ( h* )  =  0.  The  family  of  scenarios  in  which  this  is  true  are  often  collectively  referred 
to  as  the  realizable  case.  We  begin  our  analysis  by  studying  this  realizable  case  because  it 
greatly  simplifies  the  analysis,  laying  bare  the  core  ideas  in  plain  form.  We  will  discuss  more 
general  scenarios,  in  which  er t(h*)  >  0,  in  later  sections,  where  we  find  that  essentially  the  same 
principles  apply  there  as  in  this  initial  realizable-case  analysis. 

We  will  be  particularly  interested  in  the  performance  of  the  following  simple  algorithm,  due 
to  [Cohn,  Atlas,  and  Ladner,  1994b],  typically  referred  to  as  CAL  after  its  discoverers.  The 
version  presented  here  is  specified  in  terms  of  a  passive  learning  subroutine  A  (mapping  any 
sequence  of  labeled  examples  to  a  classifier).  In  it,  we  use  the  notation  DIS(V)  =  {x  E  X  : 
3 h,g  E  V  s.t.  h(x)  A  also  used  below. 
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CAL 

1. 1  <-  0,  Q0  <-  0,  and  let  h0  =  A{%) 

2.  Do 

3.  t<-t+ 1 

4.  Predict  Yt  =  /q_i(2Q) 

5.  If  max  min  er(/i:  Qt-\  U  {(Xt,y)})  =  0 

— 1 1- 1 }  /igC 

6.  Request  Yt,  let  Qt  =  Qt_x  U  {(Xt,  Yt)} 

7.  Else  let  Y(  =  argmin  min  er (h;  Qt-i  U  {(Xt,y)}),  and  let 

j/e{— i,+i}  heC 

Qt^Qt^U{(Xt,Yt')} 

8.  Let  ht  =  A(Qt) 

Below,  we  let  Aug  denote  the  one-inclusion  graph  prediction  strategy  of  [Haussler,  Little- 
stone,  and  Warmuth,  1994b].  Specifically,  the  passive  learning  algorithm  Aug  is  specified  as 
follows.  For  a  sequence  of  data  points  U  6  Xt+l,  the  one-inclusion  graph  is  a  graph,  where  each 
vertex  represents  a  distinct  labeling  of  U  that  can  be  realized  by  some  classifier  in  C,  and  two 
vertices  are  adjacent  if  and  only  if  their  corresponding  labelings  for  U  differ  by  exactly  one  label. 
We  use  the  one-inclusion  graph  to  define  a  classifier  based  on  t  training  points  as  follows.  Given 
t  labeled  data  points  C  =  {(aq,  yi), , , . ,  (xt,  yt)},  and  one  test  point  xt+i  we  are  asked  to  predict 
a  label  for,  we  first  construct  the  one-inclusion  graph  on  U  =  {x1, . . . ,  xt+i};  we  then  orient  the 
graph  (give  each  edge  a  unique  direction)  in  a  way  that  minimizes  the  maximum  out-degree,  and 
breaks  ties  in  a  way  that  is  invariant  to  permutations  of  the  order  of  points  in  U ;  after  orienting 
the  graph  in  this  way,  we  examine  the  subset  of  vertices  whose  corresponding  labeling  of  U  is 
consistent  with  £;  if  there  is  only  one  such  vertex,  then  we  predict  for  xt+i  the  corresponding 
label  from  that  vertex;  otherwise,  if  there  are  two  such  vertices,  then  they  are  adjacent  in  the 
one-inclusion  graph,  and  we  choose  the  one  toward  which  the  edge  is  directed  and  use  the  label 
for  xt+ 1  in  the  corresponding  labeling  of  U  as  our  prediction  for  the  label  of  xt+i-  See  [Haussler, 
Littlestone,  and  Warmuth,  1994b]  and  subsequent  work  for  detailed  studies  of  the  one-inclusion 
graph  prediction  strategy. 


172 


10.4.1  Learning  with  a  Fixed  Distribution 

We  begin  the  discussion  with  the  simplest  case:  namely,  when  |D|  =  1. 

Definition  10.3.  [Hanneke,  2007a,  2011  ]  Define  the  disagreement  coefficient  of  h*  under  a  dis¬ 
tribution  P  as 

0P(e)  =  sup  P  (DIS(Bp(/i*,  r)))  /r. 

r>€ 

Theorem  10.4.  For  any  distribution  P  on  X,  if  D  =  { P},  then  running  CAL  with  A  = 
Aug  achieves  expected  mistake  bound  Mt  =  O  id  log(Tj  )  and  expected  query  bound  Qt  = 
O  {0p{eT)d log2 (T)) ,  for  eT  =  d\og(T)/T. 

For  completeness,  the  proof  is  included  in  the  supplemental  materials. 


10.4.2  Learning  with  a  Drifting  Distribution 

We  now  generalize  the  above  results  to  any  sequence  of  distributions  from  a  totally  bounded 
space  D.  Throughout  this  section,  let  0o(e)  =  supPeD  0p{e). 

First,  we  prove  a  basic  result  stating  that  CAL  can  achieve  a  sublinear  number  of  mistakes, 
and  under  conditions  on  the  disagreement  coefficient,  also  a  sublinear  number  of  queries. 
Theorem  10.5.  If  3  is  totally  bounded  (Assumption  10.1),  then  CAL  (with  A  any  empirical  risk 
minimization  algorithm)  achieves  an  expected  mistake  bound  Mj  =  o(T),  and  if6o(e)  =  o(l/e), 
then  CAL  makes  an  expected  number  of  queries  Qt  =  o(T). 


Proof  As  mentioned,  given  that  er  Q^fh*)  =  0,  we  have  that  Y{  in  Step  7  must  equal  h*(Xt), 
so  that  the  invariant  er Qt(h*)  =  0  is  maintained  for  all  t  by  induction.  In  particular,  this  implies 
Qt  =  Zt  for  all  t. 

Fix  any  e  >  0,  and  enumerate  the  elements  of  By  so  that  Be  =  {Pi,  P2, ,  P|b£|}.  For  each 
f  GN,  let  k(t)  =  argminfc<|D j  \\Pk  —Vt ||,  breaking  ties  arbitrarily.  Let 


L(e)  = 
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For  each  i  <  |De|,  if  k(t)  =  i  for  infinitely  many  t  €  N,  then  let  T,  denote  the  smallest  value  of 
T  such  that  \{t  <  T  :  k(t)  =  i}|  =  L(e).  If  k{t)  =  i  only  finitely  many  times,  then  let  T]  denote 
the  largest  index  t  for  which  kit)  —  i,  or  Tt  —  1  if  no  such  index  t  exists. 

Let  Te  =  maXj<|oe|  Tt  and  V€  =  C[ZTe\.  We  have  that  Vt  >  Te,  diamt(V^)  <  diamfc(t)(V^)  +  e. 
For  each  i,  let  £,  be  a  sequence  of  Lie)  i.i.d.  pairs  (X.  Y)  with  X  ~  P,  and  Y  =  h*( X),  and  let 
Vi  =  C  [£,;].  Then  Vf  >  Te, 

E  [diamfe(t)(V;)]  <  E  [diamfc(t)(Vrfc(t))]  +  ^  \\VS- Pk(s)\\  <  E  [diamfc(t)(Vfc(t))]  +L(e)e. 

s<Ti:k(s)=k(t) 

By  classic  results  in  the  theory  of  PAC  learning  [Anthony  and  Bartlett,  1999,  Vapnik,  1982]  and 
our  choice  of  L(e),  Vf  >  Te,  E  [diamfc(t)(Vrfc(t))]  <  y'e. 

Combining  the  above  arguments, 

-  T  IT  T 

E  y^diamt(C[Zt_i])  <  Te  +  ^  E  [diam t(Ve)\  <  Te  +  eT  +  ^  E  [cliamfc(t)(K)] 

_t=  1  J  t=Te+ 1  i=Te+l 

T 

<Te  +  eT  +  L(e)eT  +  ^  E  [diamfc(t) ( V^(t) )] 

t=Te+l 

A  Tf  +  eT  +  T(e)eT  +  \feT . 

Let  be  any  nonincreasing  sequence  in  (0, 1)  such  that  1  <C  T€t  <C  T.  Since  |Oe|  <  oo  for 
all  e  >  0,  we  must  have  ex  — >  0.  Thus,  noting  that  lim^o  L(e)e  =  0,  we  have 

"  T 

E  ^  diamt(C[Zt_i])  <  TtT  +  eTT  +  L{eT)eTT  +  yTrT  T.  (10.1) 

_t= i 

The  result  on  MT  now  follows  by  noting  that  for  any  ht-\  £  C[Zt_i\  has  ert(/it_1)  < 
diamt(C[Zt_i]),  so 

■  t  i  r  t 

Mt  =  E  <  E  ^  diamt(C[Zt_i])  <  T. 

_  t=i  J  L  t=i 

Similarly,  for  r  >  0,  we  have 

P (Request  Yt)  =  E  [P(Xt  e  DIS(C[^_1]) |^_!)]  <  E  [P(9Q  e  DIS(C[Zt_1]  U  BVt(h*,  r)))] 

<  E  [0D(r)  •  max  {diany  (C [Zt.-i] ) > r }]  <  0o(r)  ■  r  +  0D(r)  •  E  [diamt(C[Zf_i])]  . 
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Letting  rT  —  T  1 E  J2t= 1  diamt(C[Zt_i])  ,  we  see  that  tt  — *  0  by  (10.1),  and  since  /9o(e)  = 
o(  1/e),  we  also  have  0u)(rr)rr  — *  0,  so  that  60(rT)rTT  <A  T.  Therefore,  Qr  equals 

T  [  T  " 

P(Request  Yt)  <  9o(rT)-rT-T+0o(rT)-E  ^  diamt(C[Zt_i])  =  20o(rT)-rT-T  <  T.  □ 

t= i  L  t= i 

We  can  also  state  a  more  specific  result  in  the  case  when  we  have  some  more  detailed  infor¬ 
mation  on  the  sizes  of  the  finite  covers  of  D. 

Theorem  10.6.  If  Assumption  10.2  is  satisfied,  then  CAL  ( with  A  any  empirical  risk  minimization 
algorithm )  achieves  an  expected  mistake  bound  MT  and  expected  number  of  queries  Qt  such  that 
Mt  =  O  ^T^+irf^+i  log2  t'J  and  Qt  =  O  (djj,  (ep)  log2  T'j,  where  6t  =  (d/T)™+ T. 


Proof  Fix  e  >  0,  enumerate  De  =  {P1,P2, . . . ,  Qnf\}-  and  for  each  t  e  N,  define  kit)  = 
argmin1<fc<|De|  \\Vt  —  Pk\\.  Let  {X[)fLx  be  a  sequence  of  independent  samples,  with  X[  ~  Pk(t), 
and  let  =  {(X[,  h*(X[)), ...,  (Xf  h*(X[)}.  Then 

E 

-  T  -|  t 

<E  diamt(C[^_1])  +  eT  <  E  diampfc(()(C[Zt'_1])  +  2eT. 

_t= i  J  t= i 

The  classic  convergence  rates  results  from  PAC  learning  [Anthony  and  Bartlett,  1999,  Vapnik, 
1982]  imply 

y^E  diamPfc(t)(C[Zt_1])  =  ^  O  (\u.<t:k(^=k(m\) 

t=i  t= i 

t  fr/IMl 

<  O(diogr)  ■  V  |„<, *(?),»(, »|  <  O(diogr)  ■  |p,|  ■  -£  i  <  o  (d|D.|  iog2(r)) . 

t= 1  u= 1 

Thus,  Ef=1  E  [diamf(C[2’t_1])]  <  O  (d|De|  log2(T)  +  eT)  <  O  (d  ■  e~m  log 2(T)  +  eT). 

Taking  e  =  (T/d)~™+ T,  this  is  O  (d™+i  ■  T m+ 1  log2(T)') .  We  therefore  have 


^  diamt(C[Zt_1])  <E  ^  diam^Cffi.J)  +  ^  \\D, 


■  t  i  r  t 

MT  <  E  V  sup  er t(h)  <E  y~^  diamt(C [Zt_f\) 

_t=i  hecizt-i]  J  Lt=i 


<  O  ^ dm+ 1  •  T m+1  log 2(T)^J  . 


175 


Similarly,  letting  er  =  ( d/T )  ™+1 ,  QT  is  at  most 


E 


£®«(DIS(C[.zt-1])) 


t=  1 


<  E 


E  Vt  (DIS  (B^  (^*>  max  {diamt(C[Zt_i]),  eT}))) 


t=  1 


<  E 


<  E 


E  ^  (6t)  '  max  {diamt(C[Zt_!]),  eT} 


,t= 1 
'  T 


E  ^  (6t)  '  diamt(C[Zt_!]) 


t=i 


-f-  (ct)  TAj1  <  O  (ct)  ■  dm+1  ■  T ™+1  log2(T)  j  . El 


We  can  additionally  construct  a  lower  bound  for  this  scenario,  as  follows.  Suppose  C  contains 
a  full  infinite  binary  tree  for  which  all  classifiers  in  the  tree  agree  on  some  point.  That  is,  there  is 
a  set  of  points  {xb  :  b  G  {0,  l}fc,  k  G  N}  such  that,  for  b\  —  0  and  V62,  b3, . . .  G  {0, 1},  3h  G  C 
such  that  ^■(^(61,...,6:,-_1))  =  bj  for  j  >  2.  For  instance,  this  is  the  case  for  linear  separators  (and 
most  other  natural  “geometric”  concept  spaces). 

Theorem  10.7.  For  any  C  as  above ,  for  any  active  learning  algorithm,  3  a  set  D  satsifying 
Assumption  10.2,  a  target  function  h*  G  C,  and  a  sequence  of  distributions  {T>t  }  J_ ,  in  O  such 

—  —  —  /  m  \  —  f  m  \  — 

that  the  achieved  MT  and  Qt  satisfy  MT  =  il  IT™+ 1  j,  and  MT  =  0  I  T "•+ 1  j  =>•  ()T  = 

The  proof  is  analogous  to  that  of  Theorem  10. 17  below,  and  is  therefore  omitted  for  brevity. 


10.5  Learning  with  Noise 

In  this  section,  we  extend  the  above  analysis  to  allow  for  various  types  of  noise  conditions  com¬ 
monly  studied  in  the  literature.  For  this,  we  will  need  to  study  a  noise-robust  variant  of  CAL, 
below  referred  to  as  Agnostic  CAL  (or  ACAL).  We  prove  upper  bounds  achieved  by  ACAL,  as 
well  as  (non-matching)  minimax  lower  bounds. 
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10.5.1  Noise  Conditions 


The  following  assumption  may  be  referred  to  as  a  strictly  benign  noise  condition,  which  es¬ 
sentially  says  the  model  is  specified  correctly  in  that  h*  £  C,  and  though  the  labels  may  be 
stochastic,  they  are  not  completely  random,  but  rather  each  is  slightly  biased  toward  the  h*  label. 

Assumption  10.8.  h*  =  sign(ry  —  1/2)  £  C  and\/x,r)(x )  f  1/2. 

A  particularly  interesting  special  case  of  Assumption  10.8  is  given  by  Tsybakov’s  noise  con¬ 
ditions,  which  essentially  control  how  common  it  is  to  have  q  values  close  to  1/2.  Formally: 

Assumption  10.9.  rj  satisfies  Assumption  10.8  and  for  some  c  >  0  and  a  >  0, 

Vt  >  0,  P(\rj(x)  —  1/2|  <t)  <  c-  ta. 

In  the  setting  of  shifting  distributions,  we  will  be  interested  in  conditions  for  which  the  above 
assumptions  are  satisifed  simultaneously  for  all  distributions  in  D.  We  formalize  this  in  the 
following. 

Assumption  10.10.  Assumption  10.9  is  satisfied  for  all  DeD,  with  the  same  c  and  a  values. 


10.5.2  Agnostic  CAL 

The  following  algorithm  is  essentially  taken  from  [Dasgupta,  Hsu,  and  Monteleoni,  2007a,  Han- 
neke,  201 1],  adapted  here  for  this  stream-based  setting.  It  is  based  on  a  subroutine:  Learn (C,  Q) 
argmin  er  (h;  Q)  if  miner  (h;  C)  =  0,  and  otherwise  Learn(£.  Q )  =  0. 

heC:er(h;C)=0  heC 
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ACAL 

1.  t  •<—  0,  Ct  <—  0,  Qt  <—  0,  let  ht  be  any  element  of  C 

2.  Do 

3.  t<-t  + 1 

4.  Predict  T)  =  ht-i(Xt) 

5.  For  each  y  e  {—1,  +1},  let  =  LEARN(£t_l5  Qt_i) 

6.  If  either  y  has  h<^!J)  =  0  or 

er(/i("y);£t_1  u  Qt-i)  -  er(/r(y);£i„i  U  Qt_i)  >  £t-i(£t-i,  Qt- i) 

7.  Ct  •<—  Ct- 1  U  {(Xt,r/)},  Qt  •<—  Qt_i 

8.  Else  Request  T£  and  let  £t  G-  £t_i,  Q*  G-  Qt_i  U  {(Xt,  Yt)} 

9.  Let  lit  =  Learn(£j,  Qt) 

10.  If  t  is  a  power  of  2 

11.  Ct  <—  0,  Qt  4—  0 


The  algorithm  is  expressed  in  terms  of  a  function  £t(£,  Q),  defined  as  follows.  Let  S, 
be  a  nonincreasing  sequence  of  values  in  (0,1).  Let  £i,  £2>  ■  ■  •  denote  a  sequence  of  inde¬ 
pendent  Uniform({— 1,+1})  random  variables,  also  independent  from  the  data.  For  V  C 

C,  let  -Rt(V  )  =  Slip/ll)/l2gy  t-2L1°S2(‘-1)J  Sm=2Ll°S2(t-1)J+l  '  ( hi(Xm )  —  h-2(Xm)),  Dt(V)  = 

suP/ji,ft2eV  t_2Llog2(‘-1)J  Srr).=2L1°82(t-1)J+l  |/ll(^m)-/l2(^m)|,t)t(V;<J)  =  12Rt(V)+34yJ Dtivy^f^l 
752in(32 1  /&) .  ^lso,  for  any  finite  sets  £,  Q  C  X  x  y,  let  C[£]  =  {h  e  C  :  er(h;C)  = 

0},  C(e;£,  Q)  =  {7i  e  C[£]  :  er(h:  C  U  Q)  —  miri,yeC[£]  er(g:  C  U  Q)  <  e}.  Then  define 
Ut(e,  5 ;  £,  Q)  =  f/t(Ct(e;  £,  Q),  5),  and  (letting  Ze  =  { j  G  Z  :  2j  >  e}) 


£*(£,  Q)  =  inf  {  e  >  0  :  Vj  G  Ze,  min£7t(e,  tf|_iog(t)j;  £,  Q)  < 

m£N 


/— 4 
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10.5.3  Learning  with  a  Fixed  Distribution 

The  following  results  essentially  follow  from  [Hanneke,  2011],  adapted  to  this  stream-based 
setting. 

Theorem  10.11.  For  any  strictly  benign  (P,rj),  if  2-2'  <A  St  <A  2~l  ji,  ACAL  achieves  an  ex¬ 
pected  excess  number  of  mistakes  MT  —  Mf  =  o(T),  and  if9p(e )  =  o(l/e),  then  ACAL  makes 
an  expected  number  of  queries  Qp  =  o(T). 

Theorem  10.12.  For  any  (P,  rf)  satisfying  Assumption  10.9,  if  D  =  {P},  ACAL  achieves  an  ex¬ 
pected  excess  number  of  mistakes  Mp  —  Mf  =  O  (d^+ 2  •  T^+ 2  log  )  +  S!=o S,21^. 

and  an  expected  number  of  queries  Qt  =  O  (0p(ep)  •  d“+2  •  lQg  ^_L_j  +  <^2® 

_  Q 

where  ep  =  T  a+2 . 

Corollary  10.13.  For  any  (P,  rf)  satisfying  Assumption  10.9,  if  D  =  {P}  and  Si  —  2  1  in  ACAL, 
the  algorithm  achieves  an  expected  number  of  mistakes  Mp  and  expected  number  of  queries  Qt 
such  that,  for 6p  =  T~ “+A  Mp  —  Mlf  =  O  (d «+2  •  T^+^j,  andQp  =  O  (0p(ep)  •  d M2  ■  T «+2  j . 

10.5.4  Learning  with  a  Drifting  Distribution 

We  can  now  state  our  results  concerning  ACAL,  which  are  analogous  to  Theorems  10.5  and  10.6 
proved  earlier  for  CAL  in  the  realizable  case. 

Theorem  10.14.  //’D  is  totally  bounded  (Assumption  10.1 )  and  rj  satisfies  Assumption  10.8,  then 
ACAL  with  Si  =  2~l  achieves  an  excess  expected  mistake  bound  Mp  —  Mf  =  o(T),  and  if 
additionally  9®(e)  =  o(  1/e),  then  ACAL  makes  an  expected  number  of  queries  Qt  =  o(T). 

The  proof  of  Theorem  10.14  essentially  follows  from  a  combination  of  the  reasoning  for 
Theorem  10.5  and  Theorem  10.15  below.  Its  proof  is  omitted. 

Theorem  10.15.  If  Assumptions  10.2  and  10.10  are  satisfied,  then  ACAL  achieves  an  expected 
excess  number  of  mistakes  Mp  —  Mf  =  O  (^T<-a+ 2)(m+n  log  (<5L1  1(T)J  )  +  zLl=o ^ '  Si 2*j,  and 
an  expected  number  of  queries  Qt  =  O  ^ 9o(ep)T  (<*+2)(™.+n  log  Sfl1 

where  tp  =  T  (Q+2^m+D . 
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The  proof  of  this  result  is  in  many  ways  similar  to  that  given  above  for  the  realizable  case, 
and  is  included  among  the  supplemental  materials. 

We  immediately  have  the  following  corollary  for  a  specific  S,  sequence. 

Corollary  10.16.  With  Si  —  2  1  in  ACAL,  the  algorithm  achieves  expected  number  of  mistakes 

—  —  _ a 

M  and  expected  number  of  queries  Qt  such  that,  for  6t  =  T  (“+2)(™+n , 

_  ~  /  (Q!+ 2)m+l  \  _  ~  /  (a+2)(m+l)  —  a  \ 

Mt  —  Mji  =  O  { T  (“+2)(7ti+i)  j  cuid  Qt  =  O  (  *  T  (a+2)(m+1)  J. 

Just  as  in  the  realizable  case,  we  can  also  state  a  minimax  lower  bound  for  this  noisy  setting. 
Theorem  10.17.  For  any  C  as  in  Theorem  10.7,  for  any  active  learning  algorithm,  3  a  set  D 
satisfying  Assumption  10.2,  a  conditional  distribution  q,  such  that  Assumption  10.10  is  satisfied, 
and  a  sequence  of  distributions  ,  in  D  such  that  the  Mt  and  Qt  achieved  by  the  learning 

algorithm  satisfy  Mt  —  Mf  =  (  T  a+2 +ma  J  and  Mt  —  Mf  =  O  [T  a+2 +rna  J  =^>-  Qt  = 

/  2+ma  \ 

n  [  t^+  met  J  ^ 

The  proof  is  included  in  the  supplemental  material. 


10.6  Querying  before  Predicting 

One  interesting  alternative  to  the  above  framework  is  to  allow  the  learner  to  make  a  label  request 
before  making  its  label  predictions.  From  a  practical  perspective,  this  may  be  more  desirable 
and  in  many  cases  quite  realistic.  From  a  theoretical  perspective,  analysis  of  this  alternative 
framework  essentially  separates  out  the  mistakes  due  to  over-confidence  from  the  mistakes  due 
to  recognized  uncertainty.  In  some  sense,  this  is  related  to  the  KWIK  model  of  learning  of  [Li, 
Littman,  and  Walsh,  2008]. 

Analyzing  the  above  procedures  in  this  alternative  model  yields  several  interesting  details. 
Specifically,  consider  the  following  natural  modifications  to  the  above  procedures.  We  refer  to 
the  algorithm  LAC  as  the  same  sequence  of  steps  as  CAL,  except  with  Step  4  removed,  and 
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an  additional  step  added  after  Step  8  as  follows.  In  the  case  that  we  requested  the  label  Yt,  we 
predict  Yt,  and  otherwise  we  predict  ht(Xt).  Similarly,  we  define  the  algorithm  ALAC  as  having 
the  same  sequence  of  steps  as  ACAL,  except  with  Step  4  removed,  and  an  additional  step  added 
after  Step  1 1  as  follows.  In  the  case  that  we  requested  the  label  Yt,  we  predict  Yt,  and  otherwise 
we  predict  ht(Xt). 

The  analysis  of  the  number  of  queries  made  by  LAC  in  this  setting  remains  essentially  un¬ 
changed.  However,  if  we  consider  running  LAC  in  the  realizable  case,  then  the  total  number  of 
mistakes  in  the  entire  sequence  will  be  zero.  As  above,  for  any  example  for  which  LAC  does 
not  request  the  label,  every  classifier  in  the  version  space  agrees  with  the  target  function’s  label, 
and  therefore  the  inferred  label  will  be  correct.  For  any  example  that  LAC  requests  the  label  of, 
in  the  setting  where  queries  are  made  before  predictions,  we  simply  use  the  label  itself  as  our 
prediction,  so  that  LAC  certainly  does  not  make  a  mistake  in  this  case. 

On  the  other  hand,  the  the  analysis  of  ALAC  in  this  alternative  setting  when  we  have  noisy 
labels  can  be  far  more  subtle.  In  particular,  because  the  version  space  is  only  guaranteed  to 
contain  the  best  classifier  with  high  confidence ,  there  is  still  a  small  probability  of  making  a 
prediction  that  disagrees  with  the  best  classifier  h*  on  each  round  that  we  do  not  request  a  label. 

50  controlling  the  number  of  mistakes  in  this  setting  conies  down  to  controlling  the  probability  of 
removing  h*  from  the  version  space.  However,  this  confidence  parameter  appears  in  the  analysis 
of  the  number  of  queries,  so  that  we  have  a  natural  trade-off  between  the  number  of  mistakes  and 
the  number  of  label  requests. 

Formally,  for  any  given  nonincreasing  sequence  5i  in  (0,1),  under  Assumptions  10.2  and 
10.10,  ALAC  achieves  an  expected  excess  number  of  mistakes  Mt  —  Mf  <  '  St2\  and 

an  expected  number  of  queries  Qt  =  O  ^d(ct)  •  T  (“+2)(™+n  log  )  +  Si=o’T^  5*2*  j, 

_ a _ 

where  ct  =  T  («+2)(™+ o .  In  particular,  given  any  nondecreasing  sequence  Mt,  we  can  set  this 

51  sequence  to  maintain  MT  —  Mf  <  MT  for  all  T. 
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10.7  Discussion 


What  is  not  implied  by  the  results  above  is  any  sort  of  trade-off  between  the  number  of  mis¬ 
takes  and  the  number  of  queries.  Intuitively,  such  a  trade-off  should  exist;  however,  as  CAL 
lacks  any  parameter  to  adjust  the  behavior  with  respect  to  this  trade-off,  it  seems  we  need  a 
different  approach  to  address  that  question.  In  the  batch  setting,  the  analogous  question  is  the 
trade-off  between  the  number  of  label  requests  and  the  number  of  unlabeled  examples  needed. 
In  the  realizable  case,  that  trade-off  is  tightly  characterized  by  Dasgupta’s  splitting  index  anal¬ 
ysis  [Dasgupta,  2005].  It  would  be  interesting  to  determine  whether  the  splitting  index  tightly 
characterizes  the  mistakes-vs-queries  trade-off  in  this  stream-based  setting  as  well. 

In  the  batch  setting,  in  which  unlabeled  examples  are  considered  free,  and  performance  is 
only  measured  as  a  function  of  the  number  of  label  requests,  [Balcan,  Hanneke,  and  Vaughan, 
2010]  have  found  that  there  is  an  important  distinction  between  the  verifiable  label  complexity 
and  the  unverifiable  label  complexity.  In  particular,  while  the  former  is  sometimes  no  better  than 
passive  learning,  the  latter  can  always  provide  improvements  for  VC  classes.  Is  there  such  a  thing 
as  unverifiable  performance  measures  in  the  stream-based  setting?  To  be  concrete,  we  have  the 
following  open  problem.  Is  there  a  method  for  every  VC  class  that  achieves  0(log(T))  mistakes 
and  o(T)  queries  in  the  realizable  case? 


10.8  Proof  of  Theorem  10.4 

Proof  of  Theorem  10.4.  First  note  that,  by  the  assumption  that  \/t,ert(h*)  =  0,  with  probability 
1  we  have  that  Vt,  Q,  =  Zt.  Thus,  since  the  stated  bound  on  MT  for  the  one-inclusion  graph 
algorithm  has  been  established  when  using  the  true  sequence  of  labeled  examples  ZT  [Haussler, 
Littlestone,  and  Warmuth,  1994b],  it  must  hold  here  as  well. 

The  remainder  of  the  proof  focuses  on  the  bound  on  Qt .  This  proof  is  essentially  based  on  a 
related  proof  of  [Hanneke,  2011],  but  reformulated  for  this  stream-based  model. 
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Let  Vt  denote  the  set  of  classifiers  h  G  C  with  er (h;  Qt )  =  0  (with  V0  =  C).  Classic  results 
from  statistical  learning  theory  [Blumer,  Ehrenfeucht,  Haussler,  and  Warmuth,  1989,  Vapnik, 
1982]  imply  that  for  t  >  d,  with  probability  at  least  1  —  5, 


diamt(Vt_i)  <  cd 


log(2 e(t  -  1  )/d)  +  log(4/5) 
t  -  1 


(10.2) 


for  some  universal  constant  c  G  (1,  oo). 

In  particular,  for  d  <  t  <  T,  since  the  probability  CAL  requests  the  label  Yt  is  P( Xt  G 
DIS(Vi_i)),  (10.2)  implies  that  this  probability  satisfies 

,log(2 e(t  -  1  )/d)  +  log (4/5)  V 


P  (Xt  G  DIS(yt_!))  <P[Xte  DIS  BP  h*,  cd 


t  -  1 


S 


<  eP  (d\og(T)/T)  cd 


l°g(2 e(t  -  1  )/d)  +  log(4/5) 
t-  1 


+  5. 


Taking  5  =  d/(t  —  1),  this  implies 

P  (Xt  G  DIS(V)_1))  <  dp  (dlog(T)/T)  2cdl°g{8ejt-l)/d) 


Thus,  for  T  >  d, 

T  T—l 

QT  =  ^P(Xte  DIS(Ct_1))  <  5  +  1  +  ^  dp(d\og(T)/T)2cdl°g^t/d) 

t= 1  t=d+ 1 

<  d  +  1  +  dp  (dlog(T)/T)  2cdlog(8eT/d)  f  -d t 

Jd  t 

=  d+l  +  dP  (d\og{T)  /T)  2cd  log(8eT/d)  log(T/d). 


□ 


10.9  Proof  of  Theorem  10.15 

The  following  lemma  is  similar  to  a  result  proven  by  [Hanneke,  2011],  based  on  the  work  of 
[Koltchinskii,  2006],  except  here  we  have  adapted  the  result  to  the  present  setting  with  changing 
distributions.  The  proof  is  essentially  identical  to  the  proof  of  the  original  result  of  [Hanneke, 
2011],  and  is  therefore  omitted  here. 
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Lemma  10.18.  [Hanneke,  2011  ]  Suppose  q  satisfies  Assumption  10.8.  For  every  i  G  N,  on  an 
event  Et  with  P (Ef)  >  1  —  8^,  Vf  G  {2*  +  1, . . , ,  2*+1},  letting  t(i)  =  t  —  2\ 


er(h*]  £t-i)  =  0, 

V/i6  C  s.t.  er(/i;  £i_i)  =  0  and  er (/i;  £t_i  U  Qt_i)  -  er(/£;  £t_i  U  Qt_i)  <  £t_i(£t_i, 
we  We  er2i+i:t_i(/i)  -  el^+m-i  (h*)  <  2£t_i(£t_i,  Qt_i), 

CK  +  l 

/ dlog(t(i)/5i)\  “+2 


•  if  Assumption  10.10  is  satisifed,  £t_i(£t_i,  Qt_i)  <  K 

for  some  (c,  a) -dependent  constant  K  G  (1,  oo). 

We  can  now  prove  Theorem  10.15. 


V  *(*) 


Proof  of  Theorem  10.15.  Fix  any  i  6  N,  and  we  will  focus  on  bounding  the  expected  excess 
number  of  mistakes  and  expected  number  of  queries  for  the  values  t  G  {2*  +  1, . . . ,  2l+1}.  The 
result  will  then  follow  from  this  simply  by  summing  this  over  values  of  i  <  log  (T). 

The  predictions  for  t  G  {2*  +  1, . . . ,  2*+1}  are  made  by  ht-\.  Lemma  10.18  implies  that  with 
probability  at  least  1  —  St,  every  t  G  {2*  +  1, . . . ,  2,+1}  has  Mh  G  C[£t_i]  with  er (h;  Ct- i  U 
Qt-i)  —  er(/£;  £t_,  U  Qt_i)  <  £t_i(£t_i,  Qf_i)  (and  therefore  in  particular  for  ht_ i) 


t-i 


y,  ers(/i)  -  er s(h*)  <  K\  ■  (t  -  2l)  ■  ( 

_o ?.  I  i  \ 


s=2*+l 


t  -  2i 


0+1 


(  d\og((t  -  2i)/Si)  \  “+2 


q+1 


<  A'i  -f«+2  •  (dlog(f/5j)) a+2  . 


(10.3) 


for  some  finite  constant  Kx. 

Fix  some  value  e  >  0,  and  enumerate  the  elements  of  De  =  { 1\  .  1:>>. . . . .  l)f!u  \ } .  Then  let 
Df +.  =  {P  G  D  :  k  =  argmin^mi  || Pj  —  P||},  breaking  ties  arbitrarily  in  the  argmin.  This 
induces  a  (Voronoi)  partition  {De  fc  :  k  <  |Oe|}  of  D. 

Rewriting  (10.3)  in  terms  of  this  partition,  we  have 

|De| 

y  y  er  s(h)  -  er  s(h*)  <  A'i  •  (t)^  •  (dlogf/Sf) . 

k= 1  se{2*+l,...,t-l}: 

T>s£^€tk 
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This  means  that,  for  any  k  <  |Oe|,  we  have 


(er Pk(h)  -  er pk(h*))  ■  \  {s  G  {2*  +  1, . . . ,  t  -  1}  :  Vs  G  D£)fc}| 

t- i 

+  Y  (eis{h)  -  er s(h*))  ■  lD\De,fc(^s) 

s=2*+l 

<  Ai  •  (t)  “+2  •  (d log(t/ 5i ))  +  2e  |  {s  G  {2*  +  1, . . . ,  t  —  1}  :  T>s  G  D6ifc} 
Abbreviating  by  k(s )  the  value  of  k  <  |De|  with  Vs  G  IED^,  we  have  that 


er t(h)  -  er t(h*) 

<  2e  +  eipHt)(h)  -eipm(h*) 

<2e  2e|{g  G  {2i  +  1,.. .  ,t  -  1}  :  k(s)  =  k(t)}\  +  K i  •  (t)^s  ■  (d\og(t/5j)) 

max  {1,  |{s  G  {2l  +  1, . . . ,  t  —  1}  :  k(s)  =  k(t)}\} 

2Ki  •  (f)^2  •  (d\og(t/5j)) 

~  +  |{s  G  {2*  +  1, . . . ,  t}  :  k(s)  ~  k(t)}\ ' 


Applying  (10.4)  simultaneously  for  all  t  G  {2*  +  1, . . . ,  2l+1}  for  h  =  ht-\,  we  have 

Ll°g(T)J 

Mt-M*<  4 eT  +  Y  2 %+ 


i= 0 

Llog(T)J  |De|  |{te{2i+l,...,2i+1}:fc(i)=fc}| 

2/Tx  •  T ^5  .  log(T)  (dlog(T/<JLlog(r)j))  EE  E 

*=0  fc=l  u=l 

Llog(T)J 

<4eT+  2^+ 

i= 0 

2A^  •  •  log(T)  (dlog(T/5Llog(T)J))  log2(2T)|De|. 

Uog(T)j 

=  0  [  er  +  e-mT^dlog3(T)log(l/5Llog(T)j)+  Y  2% 

i= 0 


q+1 


Taking  e  =  T  («+2)(™+ o ,  this  shows  that 


Mt-M*=0  T («+2)(m+i)  d  log3(T)  log(l/5Llog(r)J)  +  ^  5*2*  . 


Llog(T)j 


i=0 


(10.4) 
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We  can  bound  QT  in  a  similar  fashion  as  follows.  Fix  any  i  <  log(T).  Lemma  10.18 
implies  that  with  probability  at  least  1  —  8t,  for  every  t  G  {2l  +  1, ,  2'+l },  letting  £,  = 
^+KSl^l:te;;rwehave 
P (request  Yt\£.t-i,  Qt-\) 

<  P  ^Xj  G  DIS  ({h  G  C[£t_i]  :  er (h;  Ct- 1  U  Qt-i)  —  er (/?*;  Ct- 1  U  Qt-i)  <  £t-i(£t-ij  Qt-r)} 

<  P  (A£  G  DIS  ({/r  G  C  :  ert(/i)  -  er t(h*)  <  £*})) 

o 

o+ 1 

t 


£t- 1)  Qt- 1 


<  P  (Xf  G  DIS  G  C  :  Pt(x  :  /i(x)  ^  /i*(x))  <  K2  •  £ 
<0„ 


where  the  third  inequality  above  is  due  to  Assumption  10.10. 

Applying  this  simultaneously  to  all  i  <  log (T)  and  t  G  {2*  +  1, . . . ,  2H~1},  we  have,  for 

_  a+1 

€t  —  6  +  T  a+2 , 


Ll°g(T)J 


Llog(T)J  |De|  |{te{2i+l....,2i+1}:fc(t)=fc}| 


Qt  <  ^  5*2*  +  0O  [ef+1  j  A4dlog(r/5pog(T)j)  E  E 

i= 0 

Llog(T)J 


i=0  fc=l 


E  < 

<  £  5&  +  (^)  •  K 5  •  rflog(l/<JLiog(T)j)  log2(T)  •  (  e^T  +  |De|T(«+2)Wi) 

i=0  \ 

/  Llog(T)j 

O  ^  5,2*  +  0D  (e£+1)  log(l/5Llog(T)J)  log2(T)  •  ( e^T  +  e~m^ 


I  m  1  1 

max  <  e.  T  a+2  — 
V  1  m 


T  \  “+1 


i=0 


0  +  1 


0  +  1 


Taking  e  =  eT“  =  T  (<*+2)(m+i)  ?  We  have 

'Llog(T)j 


(o+2)(m+l)  — o 


Qt  =  O  <5*2'  +  Qt)  log(l/<S|_iog(T)j)  log  (T)  •  T  («+2)(m+i) 


o+l 


2=0 


□ 


10.10  Proof  of  Theorem  10.17 

Proof  of  Theorem  10.17.  Fix  any  T  G  N,  and  any  particular  active  learning  algorithm  A.  We 
construct  a  set  of  distributions  tailored  for  these,  as  follows.  Let  k  =  (a  +  l)/a.  Let  e  = 
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T  2/t-l  +  m5  M  =  =  g  m/ K  ^  an(-[  2K+m-l  =  T/M. 

Inductively  define  a  sequence  {bk}^=1  as  follows.  Let  bl  =  0 ,b2  =  1.  For  any  integer  k  >  3, 
given  that  values  of  7,  b2,  ■  ■  ■ ,  bk- 1,  r/3, . . . ,  r^-i,  Z>3, . . .  L>fc-i,  and  A7,  X2,  •  •  • ,  X{k_3)K  have 
already  been  defined,  it  is  known  [Hanneke,  2011]  that  for  any  active  learning  algorithm  (possibly 
randomized)  there  exists  a  value  bk  such  that,  for  the  distribution  Dk  with  Dk({xbl)b2t ...,6fc_i })  = 
e1/K  =  l  —  Dk({xbl}),  there  is  a  label  distribution  r]k(x)  =  P(Y  =  1\X  =  x)  having  ?/fc(a;fel)  =  1 
and  inducing  h*{xblib2t_, ,j6fc_1 )  =  bk,  which  also  satisfies  Tsybakov  noise  with  parameters  c  and  a 
under  distribution  Dk :  namely,  rjk(xbl  |  ^1  +  (26fc  —  l)e^  'j .  Furthermore,  [Han¬ 
neke,  2011]  shows  that  this  bk  can  be  chosen  so  that,  for  some  N  —  Q  after  observing 

any  number  fewer  than  N  random  labeled  observations  (X,  Y)  with  X  =  if  hn  is 

the  algorithm’s  hypothesis,  then  E|er(7/,n)  —  er (7/* )]  >  e,  where  the  error  rate  is  evaluated  under 
rjk  and  Dk.  In  particular,  this  means  that  if  the  unlabeled  samples  are  distributed  according  to 
Dk,  then  with  any  fewer  than  N  label  requests,  the  expected  excess  error  rate  will  be  greater 
than  e.  But  this  also  means  that  with  any  fewer  than  f2(e-1/KiV)  =  0(V>'  ~  2)  =  iliK)  unlabeled 
examples  sampled  according  to  Dk,  the  expected  excess  error  rate  will  be  greater  than  e. 

Thus,  to  define  the  value  bk  given  the  already-defined  values  7 , 7 ,  7/,._ , ,  we  consider 
X(k_3)K+1,  X(k_3)K+2, . . .,  X(fc_2  )K  i-i-d.  Dk,  independent  from  the  other  XU...,  X(k_3)K  vari¬ 
ables,  and  consider  the  values  of  bk  and  ?)k  mentioned  above,  but  defined  for  the  active  learning 
algorithm  that  feeds  the  stream  Xi}  X2, . . . ,  X(k_3) K  into  A  before  feeding  in  the  samples  from 
Dk.  Thus,  in  this  perspective,  these  Xk,X2, . . . ,  X(k_3)K  random  variables,  and  their  labels 
(which  A  may  request),  are  considered  internal  random  variables  in  this  active  learning  algo¬ 
rithm  we  have  defined.  This  completes  the  inductive  definition. 

Now  for  the  original  learning  problem  we  are  interested  in,  we  take  as  our  fixed  label  distribu¬ 
tion  an  r /  with  r]{xbl)  =  1  and  V/c  >  2,  =  rik(xb  1,b2,...,bk-1),  and  defined  arbitrariliy 

elsewhere.  Thus,  for  any  Dk,  this  satisfies  Tsybakov  noise  with  the  given  c  and  a  parameters. 

We  define  the  family  ID)  of  distributions  as  {D3, ,  I)4. . . . ,  DM+2}  f°r  XI  =  T2«+™- 1  =  e~mX 
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as  above.  Since  these  Dt  are  each  separated  by  distance  exactly  e1^,  D  satisfies  the  constraint  on 
its  cover  sizes. 

The  sequence  of  data  points  will  be  the  X1,X2, . . . ,  XT  sequence  defined  above,  and  the 
corresponding  sequence  of  distributions  has  'D1  —  V2  —  ■  ■  ■  —  VK  —  D:>,  VK+1  =  VK+2  = 
•  •  •  =  T>2K  =  D4,  and  so  on,  up  to  V(M-i)K+i  =  % \m-i)k+2  =  ■■■Vt  =  Dm+ 2- 

Now  applying  the  stated  result  of  [Hanneke,  2011]  used  in  the  definition  of  the  sequence,  for 
any  1  <t<  niin{e-1/K7V,  A'},  and  any  k  <  M,  denoting  by  hkK+t-i  the  classifier  produced  by 
A  after  processing  kK+t  —  l  examples  from  this  stream,  E  ervkK+t(hkK+t-i )  ~evvkK+t(h*)  > 

_ K 

(=  —  ^  2/t+m- 1 

Since  min{e_1/KiV,  A'}  =  Q(K),  the  expected  excess  number  of  mistakes  is 

M—l  K 

MT-  M*  =  ^  E  eTvkK+t(hkK+t- i)  -  ervkK+t (h*) 

k= 0  t= 1 

M—l  min{e_1/KAT,ftr}  M—l  min{e_1/KJV,.K'} 

^  E  E  [( srT>kK+t(hkK+t-i )j  ~  kK+t 

k= 0  t=  1  k= 0  t= 1 

=  fl  (M  ■  K  ■  e)  =  Vt  (m  •  ( T/M )  •  T~^=^j  =  Q  (r^^i  )  . 

Similarly,  applying  the  stated  result  of  [Hanneke,  2011]  regarding  the  number  of  samples 
of  labels  for  the  point  to  achieve  excess  error  e  being  larger  than  N,  we  see  that  in 

^  /  K.-\-m  —  1  \ 

order  to  achieve  this  Mt  —  =  O  (  T2K~m~]  J ,  we  need  that  at  least  some  constant  fraction 

of  these  M  segments  receive  an  expected  number  of  queries  fi(AT),  so  that  we  will  need  Qr  = 

n(M  ■  n)  =  n(r^+^).  □ 
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Chapter  11 


Active  Learning  with  a  Drifting  Target 
Concept 


Abstract 

1  This  chapter  describes  results  on  learning  in  the  presence  of  a  drifting  target  concept.  Specif¬ 
ically,  we  provide  bounds  on  the  expected  number  of  mistakes  on  a  sequence  of  i.i.d.  points, 
labeled  according  to  a  target  concept  that  can  change  by  a  given  amount  on  each  round.  Some 
of  the  results  also  describe  an  active  learning  variant  of  this  setting,  and  provide  bounds  on  the 
number  of  queries  for  the  labels  of  points  in  the  sequence  sufficient  to  obtain  the  stated  bounds 
on  the  number  of  mistakes. 


11.1  Introduction 

At  this  time,  the  work  on  active  learning  has  focused  on  learning  settings  in  which  the  concept 
to  be  learned  is  static  over  time.  However,  in  many  real-world  applications,  such  as  webpage 
classification,  spam  filtering,  and  face  recognition,  the  data  distribution  and  the  concept  itself 
''this  chapter  is  based  on  joint  work  with  Steve  Hanneke  and  Vanin  Kanade. 
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change  over  time.  Our  existing  work  in  the  previous  chapter  addresses  the  problem  of  active 
learning  with  a  drifting  distribution,  providing  theoretical  guarantees  on  the  number  of  mistakes 
and  label  requests  made  by  a  particular  active  learning  algorithm  in  a  stream-based  learning  set¬ 
ting.  However,  that  work  left  open  the  question  of  a  drifting  target  concept.  To  bridge  this  gap, 
we  propose  to  study  the  problem  of  active  learning  (and  passive  learning)  with  a  drifting  target 
concept.  Specifically,  consider  a  statistical  learning  setting,  in  which  data  arrive  i.i.d.  in  a  stream, 
and  for  each  data  point  the  learner  is  required  to  predict  a  label  for  the  data  point  at  that  time, 
and  then  optionally  request  the  true  (target)  label  of  that  point.  We  are  then  interested  in  making 
a  small  number  of  queries  and  mistakes  (including  mistakes  on  unqueried  labels)  as  a  function 
of  the  number  of  points  processed  so  far  at  any  given  time.  The  target  labels  are  generated  from 
a  function  known  to  reside  in  a  given  concept  space,  and  at  each  time  the  target  function  is  al¬ 
lowed  to  change  by  a  distance  e  (that  is,  the  probability  the  new  target  function  disagrees  with 
the  old  target  function  on  a  random  sample  is  at  most  e).  The  recent  work  of  [Koby  Crammer 
and  Vaughan,  2010]  studies  this  problem  in  the  context  of  passive  learning  of  linear  separators. 
In  this  theoretical  study,  we  intend  to  broaden  the  scope  of  that  work,  to  other  concept  spaces 
and  distributions,  improve  the  guarantees  on  performance,  establish  lower  bounds  on  achievable 
performance,  and  extend  the  framework  to  study  the  number  of  labels  requested  by  an  active 
learning  algorithm  while  maintaining  the  performance  guarantees  established  for  passive  learn¬ 
ing.  In  particular,  we  will  be  interested  in  bounding  the  number  of  queries  and  mistakes  made 
by  a  particular  algorithm,  as  a  function  of  e,  the  VC  dimension  of  the  concept  space,  and  the 
number  of  time  steps  so  far.  We  will  also  consider  variants  of  this  in  which  e  is  also  allowed  to 
change  over  time,  and  then  the  bounds  on  the  number  of  mistakes  and  queries  should  depend  on 
the  sequence  of  e  values. 
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11.2  Definitions  and  Notations 

Formally,  in  this  setting,  there  is  a  sequence  of  data  i.i.d.  unlabeled  data  X1}  X2,  ■  ■ each  with 
marginal  distribution  V  over  the  instance  space  X .  There  is  also  a  sequence  of  target  functions 
h\,  h*2 , ...  in  C,  with  V(x  :  h*(x)  f  h*t+l(x))  <  et+]  for  each  t  G  N.  Each  t  has  an  associated 
target  label  Yt  =  h*(Xt).  A  prediction  Yt  is  counted  as  a  “mistake”  if  Y,  f  Yt.  We  suppose 
each  h*t  is  chosen  independently  from  X,  .  X,+\. . . .  (i.e.,  h*t  is  chosen  prior  to  the  “draw”  of 
Xt,Xt+i, ~  V).  For  the  purposes  of  the  results  below,  we  do  not  necessarily  require  h*t  to  be 
independent  from  Xi, . . . ,  Xt_\.  Additionally,  for  any  x  G  (0,  oo),  define  Log(.x)  =  ln(x)  V  1. 

11.3  General  Analysis  under  Constant  Drift  Rate:  Inefficient 
Passive  Learning 

The  following  Lemma  is  due  to  [Vapnik  and  Chervonenkis,  1971], 

Lemma  11.1.  There  exists  a  universal  constant  c  G  [1,  oo)  such  that,  for  any  class  C  of  VC 
dimension  d,  Vm  G  N  Vd  G  (0, 1),  with  probability  at  least  1  —  5,  every  lb,  g  G  C  have 

m 

V(x  :  h(x)  *  g(x))  - -Vi  [h(Xt)  ±  g(X,)} 

m 

t= i 

<  c 

Consider  the  following  algorithm. 

0.  Predict  arbitrary  values  Y1} . . . ,  Yrn  for  Y^, ... ,  Ym,  respectively. 

1.  For  T  —  m  +  1,  m  +  2, . . . 

2.  Let  hT  =  ERM(C,  {( XT_m ,  YT_m ), . . . ,  (XT_i, 

3.  Predict  YT  =  hT(XT)  as  the  prediction  for  the  value  of  YT 
The  bound  in  the  following  theorem  is  a  generalization  of  one  given  by  [Koby  Crammer  and 

Vaughan,  2010]  for  finite  concept  classes  (which  they  claimed  could  be  extended  to  spaces  of 
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infinite  VC  dimension,  presumably  yielding  something  resembling  the  result  stated  here). 
Theorem  11.2.  If  every  et  =  e,  for  some  constant  value  e  G  (0, 1),  then  the  above  algorithm, 
with  m  =  \_\fdjt J,  makes  an  expected  number  of  mistakes  among  the  first  T  instances  that  is 
0(Vde  log(l /de)T). 

Proof  The  statement  is  trivial  for  any  e  >  l/(ed),  so  suppose  e  <  l/(ed).  Let  us  bound 
er t(ht)  :=  V(x  :  hfix)  f  If  (x))  for  an  arbitrary  t  >  rn.  By  a  Chernoff  bound,  with  probability 
at  least  1  —  5, 

1  £  I \K-JX,)  4  hum  <  +  2em2e  <  (21og2(l/<5)  +  2 ed)yi/d. 

m  m 

i=t—m 

In  particular,  this  means 
1 

-  ^  I [ht(Xi)  f  hl^Xf]  <  2(21og2(l/5)  +  2ed)^/d. 

TTt  . 

i=t—m 

By  Lemma  1 1.1,  on  an  additional  event  of  probability  at  least  1  —  5, 


V{x  :  ht(x)  f  h*t_m(x)) 

<  2(2 log2(l/ 5)+2ed)s/eJdJrC\] 2(2 log2(l/5)  +  2ed)  V^V{ddog(l  / Vde)  +  log(l / 5))2VV^‘ 

+  c(dlog(l  /  Vde)  +  log(l/5))2y/e75. 


Taking  5  =  Vde,  this  is  at  most 


2 Vde  ^(y/l/cnog2(l/<ie)  +  2e)  +  2c\/l/<i log2 (1/ de)  +  2cy/2e log(l/de)  +  clog(l/de)j 

<  14(c  +  l)Vde  log(l/de) 

Since  this  holds  with  probability  1  —  25  =  1  —  2 Vde,  and  cTt(ht)  <  1  always,  we  have 


E 


er  t(ht 


<V(x:  ht{x )  f  h*t_m(x))  +V(x:  h*t_m(x )  f  h*t(x )) 


<  14(c  +  l)\/5elog(l/(ie)  +  2 Vde  +  me  <  (14 c  +  17)\/delog(l/<ie). 
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Therefore, 

'  T 

E  I [ht(Xt)  f  h*(Xt)\  <  m  +  (14c  +  17)v/delog(l/<ie)(T  —  m)  =  0{Vde\og{l/de)T). 

_t=  1 

□ 

It  may  be  possible  to  remove  the  log(l/rfe)  factor  in  some  cases  (e.g.,  homogeneous  half¬ 
spaces  under  a  uniform  distribution  on  the  sphere);  it’s  not  yet  clear  whether  or  not  it  should 
sometimes  belong  there  in  the  optimal  number  of  mistakes. 

11.4  General  Analysis  under  Constant  Drift  Rate:  Sometimes- 
Efficient  Passive  Learning 

The  following  method  is  often  (though  certainly  not  always)  computationally  efficient.  For  in¬ 
stance,  it  is  efficient  for  linear  separators. 

0.  Let  ho  be  an  arbitrary  classifier  in  C 

1.  For T=  1,2,... 

2.  If  T  >  m[log2(l/e)"|,  let  rnT  G  {m, . . . ,  m[log2(l/e)] }  be  minimal  s.t. 

minftgc  lMxt)  ^  Yt\  =  0  (if  it  exists) 

3.  If  mT  exists,  let  hT  =  argminheC  l[Kxt)  ^  Yt] 

4.  Else  let  hT  =  hr- i 

5.  Predict  YT  =  hr(XT )  as  the  prediction  for  the  value  of  YT 

Theorem  11.3.  If  every  et  =  e,  for  some  constant  value  e  G  (0, 1),  then  the  above  algorithm, 
with  m  =  o^riog  (i/£)i  >  makes  an  expected  number  of  mistakes  among  the  first  T  instances 
that  is  0(dy/e  log2(l /e)T). 

Proof  The  statement  is  trivial  for  any  e  >  1  /(ed)2,  so  suppose  e  <  l/(ed)2.  Let  us  bound 
E[er t(hf)]  ■—  E [P(x  :  ht(x)  f  h*(x))}  for  an  arbitrary  t  >  m  log2(l/y/e). 
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Fix  any  M  e  {m, . . . ,  m|"log2(l/e)"| }.  By  a  Chernoff  bound,  with  probability  at  least  1  — 

e/(mflog2(l/e)]), 

^  t—M+m—  1  ^ 

—  llh*t-m\\og2(i/e)-](xk)  +  K(xk)\  <  —  log2((m[log2(l/e)])/e)  +  2eepog2(l/e)]m. 

1 1 1  ,  ,  ,  III 

k=t—M 

Combined  with  Lemma  11.1,  this  implies  that  with  probability  at  least  1  —  2e/(m|~log2(l/e)~|), 
for  any  h  e  C  with 

t—M+m—  1 

V  i[h(xk)  /  h-k( xk)\  =  o. 

k=t—M 

it  must  have 

..  t—M+m—  1 

—  +  Hxk)]  <  ~  log2((m[log2(l/e)])/e)  +  2eepog2(l/e)]m, 

k=t—M 

and  therefore 


V(x  :  h(x)  ±  h*t-m.\log2{l/e)'}  (x)) 

<  (—\og2((m\log2(l/e)])/e)  +  2ee\log2(l/e)]m) 

\m  J 

+ J(  iiog2((,nriog2(iA)i)A) + 2eaiofe(iA)im) dlog{m/d)  +  M(mri°g,(l/t)l)/«) 

y  \m  J  m 

d\og(m/d)  +  log((m[log2(l/e)])/e) 

+  c - 

m 

<  19^/e  log2(l/e)  +  12 cVde  log2(l/e)  +  24cd\/e  log2(l/e) 

<  55cd+/e log2(l/e). 

If  this  is  the  case,  then 

er t{h)  <  V{x  :  h*_mllog2{1/e)](x)  7^  ht(x))  +  V(x  :  h(x)  ±  h*t_mllog2{1/e)]  0)) 

<  em|~log2(l/e)~|  +  55cd\/e  log2(l/e) 

<  56cdi/e log2(l/e). 

Thus,  by  a  union  bound,  with  probability  at  least  1  —  2e,  if  mt  exists,  then 

er t(ht)  <  56cdy/elogl(l / e) . 
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For  any  given  i  e  {1, . . . ,  |~log2(l/e)~| },  by  a  union  bound,  the  probability  that 

1)— 1 

^[^'t-m|'log2(l/e)l  7^  hk(Xk)]  >  0 

k=t—mi 

is  at  most  e|~log2(l/e)~| m2  <  1/2.  Since  these  sums  are  independent  over  values  of  i,  we  have 
that  with  probability  at  least  1  —  e,  at  least  one  of  these  values  of  i  e  {1, . . . ,  |~log2(l/e)~| }  will 
haVe  Et=T-^1)_1  |"log2 (1  /e)l  (Xp.)  f  ///(AY)]  =  0.  In  particular,  on  this  event,  this  implies 

rnt  exists  in  Step  2. 

Altogether,  since  er t(ht)  <  1  always,  we  have 


E[ert(/it)]  <  56cdy/e log2(l/e)  +  3e  <  59cdy/e\ogl(l / e) . 


Therefore, 


E 


1 

E1  *  ht(X>) 

t= 1 


<  m|~log2(l/e)"|  +  SQcdv^log^l/6)^1  —  O  (d\/e log2(l/e)T)  . 


□ 


11.4.1  Lower  Bounds 

In  this  section,  we  establish  a  lower  bound  on  the  number  of  mistakes  that  can  be  achieved  when 
the  target  function  may  drift  by  e,  at  each  step. 

Thresholds 

For  simplicity,  we  first  consider  the  case  where  the  distribution  is  uniform  over  [—1,1],  and  the 
concept  class  is  threshold  functions.  Between  each  time-step  the  threshold  may  move  to  the  left 
or  right  by  e. 

Theorem  11.4.  For  any  e  <  1/16,  any  algorithm  for  learning  under  drifting  targets  makes  at 
least  y/eT / 4e  in  expectation. 
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Proof.  Consider  the  following  strategy  that  the  adversary  uses  to  define  the  drifting  thresholds. 
For  simplicity  assume  that  2 j  yfe  is  an  even  integer  and  T  is  divisible  by  2/ yfe.  The  game  is 
divided  into  k  —  T/ {2 /yfe)  epochs,  each  consisting  of  2/ yfe  time  steps.  We  have  the  following: 


•  At  the  beginning  of  each  epoch,  the  threshold  is  at  0.  The  adverary  tosses  an  unbiased 
coin. 

•  If  the  outcome  is  heads,  for  the  next  1/y/e  time-steps,  the  threshold  increase  by  e  at  each 
time-step.  Then  for  the  next  1  /\/e  it  decreases  by  e  at  each  time-step.  Thus,  at  the  begin¬ 
ning  of  the  next  epoch,  the  threshold  is  again  at  0. 

•  If  the  outcome  is  tails,  the  adversary  first  decreases  the  threshold  by  e  for  the  first  1  /yfe 
time-steps;  then  increases  again.  Thus,  in  either  case,  at  the  end  of  the  epoch  the  threshold 
is  again  at  0. 

We  first  assume  that  the  algorithm  knows  the  strategy  of  the  adversary  (but  not  the  coin 
tosses).  This  can  only  make  the  algorithm  more  powerful.  Since  at  the  end  of  each  epoch,  the 
algorithm  knows  exactly  where  the  threshold  is,  the  total  (expected)  number  of  mistakes  is  k 
times  the  expected  number  of  mistakes  in  each  epoch.  Without  loss  of  generality  consider  the 
first  epoch,  i.e.,  time-steps  1  to  2/y/e.  For  t  <  ft,  let  Zt  denote  the  random  variable  that  is  1  if 
at  time-step  t,  the  random  example  xt  is  inside  the  interval  |  —ft.  et].  Note  that  Pr|Z,  =  1]  =  et. 
Let  Mt  denote  the  random  variable  that  is  1  if  the  algorithm  makes  a  mistake  at  time-step  t 
and  0  otherwise.  (Here  the  expectation  is  over  the  randomness  of  the  examples  as  well  as  the 
adversary’s  coin  toss).  Then,  consider  the  following: 

E[Mt  |  Z\  =  0, . . . ,  Zt_i  =  0,  Zt  =  1]  =  - 

This  is  because,  the  only  information  the  algorithm  has  at  this  time  is  that  the  threshold  is  either 
at  —et  or  et,  each  with  equal  probability.  Therefore, 

etjl  -  yf-ef-1 
2 
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Let  S  =  1  /y/e.  Then,  the  expected  number  of  mistakes  between  the  time-steps  1  to  S'  is 
E[Ef=i  Mt]  =  Ef=i  E[MJ.  Then,  we  have 


5  S' 

X>[m,]  >  1 5><i  -  v; 


Using  the  fact  that  Ef=i  ^  1  >  (1  —  a:6) /{l  —  x)  for  small  enough  x,  we  get 


2^  E[Mt]  >  -  •  v  v  ' 


t=i 


2  (1- (1-^)2 


1 

>  — 
“  2e 


In  the  last  line  we  used  the  fact  that  (1  —  x)l-x  <  1/e.  Now,  it  must  be  the  case  that  the  total 
(expected)  number  of  mistakes  is  at  least  k/2e  —  ^feT /  (4e).  □ 


Halfspaces 

Now  consider  the  case  where  X  =  Mfc  for  k  G  N,  and  where  the  concept  space  C  is  the  set  of 
halfspaces  (linear  separators):  that  is,  for  every  h  G  C,  3w  G  Mfc  and  6  6  K  such  that  \/x  G  Mfc, 
h(x)  =  +1  iff  w  ■  x  +  b  >  0.  In  this  case,  we  have  the  following  result. 

Theorem  11.5.  For  any  k  G  N,  for  X  =  Rk  and  C  the  class  of  halfspaces  on  Rk,  for  any  e  <  l/k, 
for  any  algorithm  for  learning  under  e-drifting  targets,  there  exists  a  distribution  V  over  Rk  and 
a  sequence  of  e-drifting  (w.r.t.  V)  targets  If,  h-f  . . .  in  C  such  that,  for  any  T  G  N,  the  expected 
number  of  mistakes  made  by  the  algorithm  among  the  first  T  rounds  is  at  least  VekT/8. 

Proof  Consider  the  distribution  V  that  is  uniform  over  the  set 

k 

(JjO}*"1  x  [0, 1]  x  {0}^  : 

1=1 

that  is,  V  is  uniform  in  [0, 1]  along  each  of  the  axes.  Now,  by  the  probabilistic  method,  it  suffices 
to  show  that  there  exists  a  way  to  randomly  set  the  sequence  of  target  functions  so  that  the  ex¬ 
pected  number  of  mistakes  is  at  least  the  stated  lower  bound.  We  will  choose  the  target  functions 
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from  among  the  subset  of  C  consisting  of  halfspaces  whose  respective  separating  hyperplanes 
intersection  all  k  axes  in  [0, 1]:  that  is,  Mi  <  k,{x  :  w-x+b  =  0}n({0}*-1  x  [0, 1]  x  ^  0. 

Note  that  each  halfspace  of  this  type  can  be  specified  by  k  values,  (z1} . . . ,  zk ),  corresponding 
to  the  k  intersection  values  with  the  axes:  that  is.  Mi  <  k,  the  x  €  {O}4"1  x  [0, 1]  x  {0}fe-*  has 
Xi  =  Zi  e  [o,  l]. 

Consider  the  following  strategy  that  the  adversary  uses  to  define  the  drifting  targets.  For  sim¬ 
plicity  assume  that  2  \Jk/f  is  an  even  integer  and  T  is  divisible  by  2 a/ /c/e.  The  game  is  divided 
into  £  =  T/ (2a/ k/e)  epochs,  each  consisting  of  2 a/ /c/e  time  steps.  We  have  the  following: 

•  At  the  beginning  of  each  epoch,  the  target  function  has  =  1/2  for  all  i  <  k.  The  adverary 
tosses  k  unbiased  coins  ci, ...  ,ck. 

•  For  each  i  <  k,  if  the  outcome  of  tossing  c,  is  heads,  for  the  next  \J k / e  time-steps,  the 
value  of  Zi  is  increased  by  e  at  each  time-step,  and  then  for  the  following  a/ k/e  time-steps 
it  decreases  by  e.  Thus,  at  the  beginning  of  the  next  epoch,  the  target  once  again  has 
Zi  =  1/2  for  all  i  <  k. 

•  For  each  i,  if  the  outcome  of  ct  is  tails,  the  adversary  first  decreases  the  value  of  z,  by  e 
for  the  next  y/k/e  time-steps,  and  then  increases  again  by  e  on  each  round.  Thus,  in  either 
case,  at  the  end  of  the  epoch  the  target  again  has  Mi  <  k,  zt  =  1/2. 

We  first  assume  that  the  algorithm  knows  the  strategy  of  the  adversary  (but  not  the  coin 
tosses).  This  can  only  make  the  algorithm  more  powerful.  Since  at  the  end  of  each  epoch,  the 
algorithm  knows  exactly  where  the  threshold  is,  the  total  (expected)  number  of  mistakes  is  £ 
times  the  expected  number  of  mistakes  in  each  epoch.  Without  loss  of  generality  consider  the 
first  epoch,  i.e.,  time-steps  1  to  2  a/ k/e.  For  t  <  y/k/e  and  i  <  k,  let  Zit  denote  the  random 
variable  that  is  1  if  at  time-step  t,  the  ith  coordinate  of  the  random  variable  xt  is  inside  the  interval 
[1/2  —  et,  1/2  +  et\.  Note  that  Pr|Zt/  =  1]  =  2ft / k.  Let  Mt  denote  the  random  variable  that 
is  1  if  the  algorithm  makes  a  mistake  at  time-step  t  and  0  otherwise.  (Here  the  expectation  is 
over  the  randomness  of  the  examples  as  well  as  the  adversary’s  coin  tosses).  Then,  consider  the 
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following: 


E [Mt  |  Zn  —  0, ,  Zj(t_ i)  —  0,  Zit  —  1]  —  -. 

For  any  i  <  k,  if  any  Z,,  =  1  for  /  <  yjk/e,  then  there  must  exist  a  first  such  t,  in  which  case 
the  above  equality  holds  at  that  time  t.  Therefore, 


E 


EM< 


t= i 


A/ 

>  ^  -P  <  yfkfe  :  =  l) 


(  y/k/e  ^ 

1  -  Yl  (!  -  2 et/k) 

\  t=1  ) 


( 

1  —  exp 

V 


-2 (e/fc)  ^  t 

t= i 


\ 

/ 


>  ^  (!  -  e  X)  >  A:/4. 


Now,  it  must  be  the  case  that  the  total  (expected)  number  of  mistakes  is  at  least  Ik/ 4  =  Tyfek/ 8. 

□ 


11.4.2  Random  Drifts 

In  this  section,  we  consider  a  very  simple  case  of  “random  drift”.  We  consider  the  class  of 
homogeneous  linear  separators  in  K2,  say  C2  and  let  //  be  any  radially  symmetric  measure  on  R2. 

We  show  a  simple  lower  bound  that  the  achievable  target  drift  rate  in  this  setting  is  0(e2^T). 
Proposition  11.6.  Let  C2  be  the  class  of  homogeneous  linear  separators  in  M2  and  let  p  be  any 
radially  symmetric  measure  on  R2.  Then,  if  C\ ,  rv, ....  Ct  is  a  ( random )  sequence  of  concepts 
from  C 2,  where  cl+\  is  chosen  uniformly  at  random  from  one  of  the  two  concepts  in  C2,  such  that 
err^(cj,  Cj+i)  =  e.  Then,  for  any  algorithm  the  expected  number  of  mistakes  is  fi(e2//3T).  (Here 
the  expectation  is  taken  over  the  randomness  of  the  sequence  Ci  and  the  examples  drawn  from  p.) 

Proof  This  follows  from  the  anti-concentration  of  the  standard  random  walk.  □ 

Proposition  11.7.  Under  conditions  of  the  above  proposition  -  the  algorithm  above  achieves  a 
mistake  bound  of  0(e2^T). 
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Proof.  The  main  idea  is  that  because  of  random  drift,  the  expected  number  of  examples  that  are 
consistent  with  a  fixed  classifier  is  actually  1  / e1/3,  instead  of  1/ y/e.  □ 

11.5  Linear  Separators  under  the  Uniform  Distribution 

For  the  special  case  of  learning  linear  separators  in  Mfc,  the  results  of  Section  11.4  imply  that 
it  is  possible  to  achieve  an  expected  number  of  mistakes  and  queries  0(dy/eT )  among  the  first 
T  instances,  using  an  algorithm  that  runs  in  time  poly(d,  1/e)  (and  independent  of  T)  for  each 
prediction.  In  the  special  case  of  learning  homogeneous  linear  separators  under  the  uniform 
distribution  on  a  unit  sphere,  it  is  possible  to  improve  this  result;  specifically,  we  show  there  exists 
an  efficient  algorithm  that  achieves  a  bound  on  the  expected  number  of  mistakes  and  queries  that 
is  0(Vd~eT),  as  was  possible  with  the  inefficient  algorithm  of  Section  11.3.  The  technique  is 
based  on  a  modification  of  the  algorithm  presented  in  Section  11.3,  replacing  ERM  with  (a 
modification  of)  the  computationally-efficient  algorithm  of  [Awasthi,  Balcan,  and  Long,  2013]. 
Formally,  define  the  class  of  homogeneous  linear  separators  as  the  set  of  classifiers  hw  : 
— >  {—1,  +1},  for  w  E  R.d  with  ||w||  =  1,  such  that  hw(x )  =  sign(tc  •  x )  for  every  x  E  Rd. 
We  have  the  following  result. 

Theorem  11.8.  When  C  is  the  space  of  homogeneous  linear  separators  (with  d  >  4)  and  V 
is  the  uniform  distribution  on  the  surface  of  the  origin-centered  unit  sphere  in  Wl,  when  et  = 
e  >  0  (constant)  for  all  t  E  N,  there  is  an  algorithm  that  runs  in  time  polyfd,  1/e)  for  each 
prediction,  which  makes  an  expected  number  of  mistakes  among  the  first  T  instances  that  is 
O  (yfed  log3/2  (^)  T  'j .  Furthermore,  the  expected  number  of  labels  requested  by  the  algorithm 
among  the  first  T  instances  is  O  (^fed  log3/i  (^)  T  'j . 

Before  stating  the  proof,  we  have  a  few  additional  definitions  and  lemmas  that  will  be  needed. 
For  r  >  0  and  x  E  M,  define  lT(x)  =  max  {0, 1  —  Consider  the  following  algorithm  and 
subroutine;  parameters  5k,  rnk,  Tk,  ry.,  bk,  a,  and  n  will  all  be  specified  below;  we  suppose 


Algorithm:  DriftingHalfspaces 

0.  Let  wo  be  an  arbitrary  element  of  Rd  with  \\w0\\  =  1 

1.  For  i  —  1,2,... 

2.  ABL (M(i  -  1)) 

Subroutine:  ModPerceptron(f) 

0.  Let  wt  be  any  element  of  with  ||uy||  =  1 

1.  For  m  =  t  +  1,  t  +  2, . . . ,  t  +  mo 

2.  Predict  Ym  =  /iWm_1  (Xm)  as  the  prediction  for  the  value  of  Ym 

3.  Request  the  label  Ym 

4  If  Y  A  Y 

-r.  ±  m  y-  ±  m 

5-  -4  wm__\  2(wm_!  •  Xm)Xm 

6.  Else  wm  <—  wm- 1 

7.  Return  u.',+mo 
Subroutine:  ABL(t) 

0.  Let  w0  be  the  return  value  of  Mo dPercep tron ( t ) 

1.  Forfc  =  1,2,...,  flog2(l/ a)] 

2.  Wk  {} 

3.  For  s  =  t  +  Ej=d  mj  +  1,  •  •  • ,  t  +  YlUo  mj 

4.  Predict  Ys  =  hV!k_  t  (Xs)  as  the  prediction  for  the  value  of  Ys 

5.  If  \wk-i  ■  Xs\  <  bk- 1,  Request  the  label  Ys 

6.  and  let  Wk  ^WkU{ (Xs ,Ya)} 

7.  Find  vk  G  with  ||nfc  —  iufc_i||  <  rk,  0  <  \\vk\\  <  1,  and 

8.  E  Uk(y{vk-x))<  inf  E  ^rk(y(v  ■  x))  +  n\wk\ 

(. x,y)ewk  «:||t;-u;fc_i||<rfc  (x,y)eWk 

9.  Let  wk  =  ii^ii vk 

The  following  result  for  ModPerceptron  was  proven  by  [Koby  Crammer  and  Vaughan, 

Consider  the  values  wrn  obtained  during  the  execution  of 


2010], 

Lemma  11.9.  Suppose  e  < 


i 

512 - 
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ModPerceptron(t).  Vm  e  {t  +  1, . . . ,  t  +  m0},  V(x  :  hWm(x)  f  h*m(x ))  <  V{x  :  hWm_ fx)  f 
h*m{x )).  Furthermore,  letting  C\  =  tfV{x  :  hWm_1(x )  f  h*n(x ))  >  1/32,  f/ien  w/Y/z 

probability  at  least  1/64,  P(x  :  hWm(x )  7^  h*m(x))  <  (1  —  Ci)P(x  :  hWm_1(x )  7^  h*m(x)). 

This  implies  the  following. 

Lemma  11.10.  Suppose  e  <  400^2 37(r  mo  =  max  j 1~512  In  >  r  128(1/ Ci)  ln(32)] 

wzY/z  probability  at  least  1  —  x/de,  Mo dPercep Iron ( 1 )  returns  a  vector  w  with  V(x  :  hw(x)  f 

K+mo+l(X))  <  VIS. 

Proof  By  Lemma  1 1.9  and  a  union  bound,  in  general  we  have 

V(x  :  hWm(x)  ^  h*m+1{x))  <  V(x  :  K^x)  ^  h*m(x))  +  e.  (11.1) 

Furthermore,  if  V(x  :  hWm_  ,  (x)  f  h*m(x))  >  1/32,  then  wth  probability  at  least  1/64, 

V(x  :  hWm(x)  f  h*m+1(x))  <  (1  -  cx)V{x  :  Vn-iO)  f  h*m(x))  +  e.  (11.2) 

In  particular,  this  implies  that  the  number  N  of  values  m  e  {t  +  1, . . . ,  t  +  m0}  with  either 
V(x  :  Kn-Ax)  f  h*m(x ))  <  1/32  orV(x  :  hWm(x )  h*n+1{x))  <  (1  -  cx)V[x  :  hWm_fx)  f 

h*m(x))  +  e  is  lower-bounded  by  a  Binomial(m,  1/64)  random  variable.  Thus,  a  Chernoff  bound 
implies  that  with  probability  at  least  1  —  exp{— m0/512}  >  1  —  y/de,  we  have  N  >  m0/ 128. 
Suppose  this  happens. 

Since  em0  <  1/32,  if  any  me  {f  +  1, . . . ,  t  +  m0}  has  V(x  :  hWm_1  (x)  7^  h*m(x))  <  1/32, 
then  inductively  applying  (11.1)  implies  V(x  :  hWt+mo  (x)  f  h*t+mo+1(x))  <  1/32  +  em0  <  1/16. 
On  the  other  hand,  if  all  m  e  {t  +  1, . , . ,  t  +  m0}  have  V(x  :  h^^x)  f  h*m(x ))  >  1/32,  then 
in  particular  we  have  IV  values  of  m  e  {f  +  1, . . . ,  t  +  m0}  satisfying  (11.2).  Combining  this 
fact  with  (11.1)  inductively,  we  have  that 

V(x  :  hWt+mo(x)  f  h*t+mo+l(x ))  <  (1  -  ci)V(a:  :  hWt{x)  f  h*t+l(x))  +  em0 

<  (1  -  ci)(1/ci)In(32)P(x  :  hwt{x)  f  h*t+1(x ))  +  em0  <  7^  +  emQ  < 

□ 
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Next,  we  consider  the  execution  of  ABL(t),  and  let  the  sets  114  be  as  in  that  execution.  We 
will  denote  by  w*  the  weight  vector  with  ||w*||  =  1  such  that  h*+mQ+1  =  hw* .  Also  denote  by 
Mi  =  M  —  m0. 

The  proof  relies  on  a  few  results  proven  in  the  work  of  [Awasthi,  Balcan,  and  Long,  2013], 
which  we  summarize  in  the  following  lemmas.  Although  the  results  were  proven  in  a  slightly 
different  setting  in  that  work  (namely,  agnostic  learning  under  a  fixed  joint  distribution),  one  can 
easily  verify  that  their  proofs  remain  valid  in  our  present  context  as  well. 

Lemma  11.11.  [Awasthi,  Balcan,  and  Long,  2013]  Fix  any  k  G  {1, . . . ,  [log2(l/aj] }.  Suppose 
bk-\  =  Cj21~k / \/d  for  a  universal  constant  cy  >  0,  and  let  Zk  =  \jr\/ (d  —  1)  +  h\_v  For  a 
universal  constant  C\  >  0,  if\\w*  —  1 1|  4 


E 


y:  4fcdw*  •  ai)  \wk\ 

(x,y)€Wk 


-E 


y:  ^ Tk(y(w *  ■  x))  wk-i,  \wk\ 

(x,y)£Wk 


4  ci\Wk\\/2keMi  — . 

1~k 


Lemma  11.12.  [Balcan  and  Long,  2013]  For  any  c  >  0,  there  is  a  constant  d  >  0  depending 
only  on  c  (i.e.,  not  depending  on  d)  such  that,  for  any  u,  v  G  E':/  with  \ \  a \  =  \ \ v \ \  =  1,  letting 
A  =  V(x  :  hu{x )  f  hv(x)),  if  A  <  1/2,  then 

V  :  hu(x)  f  hv(x)  and  \v  ■  x\  >  d -j=^j  <  cA. 

The  following  is  a  well-known  lemma  concerning  concentration  around  the  equator  for  the 
uniform  distribution  (see  e.g.,  [Awasthi,  Balcan,  and  Long,  2013,  Balcan,  Broder,  and  Zhang, 
2007b,  Dasgupta,  Kalai,  and  Monteleoni,  2009]);  for  instance,  it  easily  follows  from  the  formulas 
for  the  area  in  a  spherical  cap  derived  by  [Li,  2011]. 

Lemma  11.13.  For  any  constant  C  >  0,  there  are  constants  c2,  c3  >  0  depending  only  on  C 
(i.e.,  independent  of  d)  such  that,  for  any  w  G  14  with  ||w||  =  1,  Vy  G  [0 ,C/Vd\, 


c27\/ d  <  V  (x  :  \w  ■  x\  <  7)  <  cgysfd. 
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Based  on  this  lemma,  [Awasthi,  Balcan,  and  Long,  2013]  prove  the  following. 

Lemma  11.14.  [Awasthi,  Balcan,  and  Long,  2013]  For  X  ~  V,  for  any  w  G  R'/  with  \ \  w  |  =  1, 
for  any  C  >  0  and  r,  h  G  [0,  <27/ \fd] ,  for  c2,c3  as  in  Lemma  11.13, 


E 


tr{\w*-X\) 


\wX\  <  h 


< 


c3r 

c2b' 


The  following  is  a  slightly  stronger  version  of  a  result  of  [Awasthi,  Balcan,  and  Long,  2013] 
(specifically,  the  size  of  mk,  and  consequently  the  bound  on  |  Wk  | ,  are  both  improved  by  a  factor 
of  d  compared  to  the  original  result). 

Lemma  11.15.  Fix  any  5  G  (0,1/e).  For  universal  constants  c4,  c5,  c6,  C7,  c8,  c9,  Ci0  G  (0, 00), 
for  an  appropriate  choice  of  n  G  (0,1)  (a  universal  constant),  if  a  =  Cg  ed  log  (/),  for 
every  k  G  {1, . . . ,  [log2(l/«)] },  ifbk- 1  =  c721_fc / s/d,  rk  =  cf2~kj\[d,  rk  =  cw2~k,  8k  = 
5/(flog2(4/a)l  -  k)2,  and  mk  =  c5^d log  ,  andifV(x  :  hWk_fx)  f  hw*(x))  < 

2~k~3,  then  with  probability  at  least  1  —  (4/3 )5k,  \Wk\  <  c^-^d  log  and  V(x  :  hWk(x )  f 
hw.(x))  <  2"fc”4. 


Proof  By  Lemma  11.13,  and  a  Chernoff  and  union  bound,  for  an  appropriately  large  choice  of 
c5  and  any  c7  >  0,  letting  c2,  c3  be  as  in  Lemma  11.13  (with  C  =  c7  V  (c8/ 2)),  with  probability 
at  least  1  —  <5^/3, 

c2c72~kmk  <  \Wk\  <  4:c3c72~kmk.  (11.3) 

The  claimed  upper  bound  on  |  Wk  \  follows  from  this  second  inequality. 

Next  note  that,  if  V[x  :  hWk_1(x)  f  hw*(x ))  <  2~k~3,  then 

max{(Tt (y(w*  ■  x))  :  x  G  \wk-i  ■  x\  <  bk-i,y  G  {-1,+1}}  <  cnVd 

for  some  universal  constant  cn  >  0.  Furthermore,  since  V(x  :  hWk_1(x)  f  hw*(x ))  <  2~k~3, 
we  know  that  the  angle  between  wk- 1  and  w*  is  at  most  2~k~3n,  so  that 


||wfe_i  —  w*  ||  =  \[2  —  2wfc_i  •  w*  <  \J2  —  2  cos(2~fc_37r) 

<  a/2  -  2cos2(2-fc-%)  =  a/2  sin(2_fc_37r)  <  2~k~3nV2. 
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For  cio  =  tt  '/22~3,  this  is  rk.  By  Hoeffding’s  inequality  (under  the  conditional  distribution  given 
\Wk\),  the  law  of  total  probability.  Lemma  11.11,  and  linearity  of  conditional  expectations,  with 
probability  at  least  1  —  5k/3,  for  A"  ~  V, 


£ Tk(y(w *  ■  X ))  <  \Wk\E  £Tk{\w*  ■  X|)  wk- 1,  |wfc_i  •  X\  <  bk_i 

(x,y)£Wk 

+  ci\Wk\\/2keMi—  +  J\Wk\(l/2)c2udM3/6k).  (11.4) 
rk  v 

We  bound  each  term  on  the  right  hand  side  separately.  By  Lemma  1 1.14,  the  first  term  is  at  most 


zk  _  y/ cf02~2fc/ {d  -  1)  +  4cf2~2fc/ d  <  y/2cf0  +  4cf 


rk 


Cg2~k  / \fd 


c8 


while  2k  <  2/ a  so  that  the  second  term  is  at  most 


^civg^+4g 

Cg  \  a 


Noting  that 


riog2  (Va)i 

Ml  =  y  rnki  < 

k'= 1 


32c5  1  /1\ 

^Al0gUJ’ 

we  find  that  the  second  term  on  the  right  hand  side  of  (1 1.4)  is  at  most 


(11.5) 


(c5~8ci  \/2c\ 0  +  4 cf ,  i  ltd  log  (k5)  Sciy'cs  \/2cf0  +  4cf 


■|Wfc 


|W) 


k  ■ 


Cg  K  Cg  y  CT  K  CgCg 

Finally,  since  dln(3/Sk)  <  2dln(l/dk)  <  —2 ~kmk,  and  (11.3)  implies  2 ~kmk  <  —  \Wk\,  the 

C5  C2C7 

third  term  on  the  right  hand  side  of  (1 1.4)  is  at  most 


\Wk 


cii/t 
\/C2  C5C7 


Altogether,  we  have 


eTk(y{w*  ■  x))  <  \wk 

(x,y)£Wk 


c3c8  8ciy^  \/2ci0  +  4Cy 
2c2c7  ft;  c8c9 


cnft; 

VC2C5C7 
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Taking  c9  =  1/k3  and  c8  =  k,  this  is  at  most 


(d+  +  8c>vW2+  +  4c+y||f) ' 

Next,  note  that  because  hWk(x)  ^  y  ^  £Tk(y(vk  ■  x))  >  1,  and  because  (as  proven  above) 
||w*  -  wfe_i||  <  rk, 

\Wk\erWk {hWk)  <  Y  £rk(y(vk  ■  x))  <  Y  ^rk{y[w*  ■  x))  +  n\Wk\. 

(x,y)£Wk  (x,y)£Wk 

Combined  with  the  above,  we  have 

\Wk\eiwk(hWk)  <  K\Wk\  ^1  +  +  8c1\/c5\/2c10  +  4c7  +  J^c^c  )  ' 

Let  C12  =  1  +  2^7  +  801^^2^0  +  4^  +  ^==-  Furthermore, 

\Wk\eiWk(hWk)  =  Y  l\h^x)^y\ 

(x,y)£Wk 

>  Y  ^[hwk(x)  7^  hw*(x)}  -  Y  %[hw*(x)  7^  y\- 
(■ x,y)£Wk  ( x,y)ewk 

For  an  appropriately  large  value  of  c5,  by  a  Chernoff  bound,  with  probability  at  least  1  —  5 k/ 3, 

i+Ejfe=o  mi 

Y  ^[hw*(Xs)  ±  Ys]  <  2eeM1mk  +  log2(3/4)- 

s=t-\~Yj—Q 

In  particular,  this  implies 

Y  ^[hw.(x)  ^y\<  2eeM1mk  +  log2(3/4), 

(x,y)ewk 

so  that 


Y  l[hWk(x)  ^  hw*(x)\  <  \Wk\erWk(hWk)  +  2eeM1mk  +  log2(3/4). 

(a :,y)GWk 


iNoting  tnat  (it. 


, ,  32c5 

eMiink  <  e — - 

Kz 


dl°g(^) 


_L\  C2C7 


\Wk\  < 


32c>, 


C2C7CghC- 


\Tlog(£) 


K*  Cg^/edlog  (Y) 


2k\Wk 


32c; 


c2c7cln2 


a2k\Wk 


C2C7 


C2C7 
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and  (11.3)  implies  log2(3/<5fc)  <  2k  \Wk\,  altogether  we  have 

C“2  C5  C'j 

V  I [hn(x)  7  hw.(x)\  <  \Wk\erwk(hWk)  +  +  ^—\Wk\ 

(«-»)€, n  C2C 7  W7 

.  .  (  64ec5K3  2k  \ 

<  K\Wk\  C12  +  - —  +  -  . 

V  C2C7  C2C5C7/ 

Letting  ci3  =  c12  +  ^  and  noting  re  <  1,  we  have  E(x,y)eWk  l[Kk{x)  ±  hw*(x)\  < 

Cl3«|Wfc|. 

Lemma  11.1  (applied  under  the  conditional  distribution  given  \Wk\)  and  the  law  of  total 
probability  imply  that  with  probability  at  least  1  —  5k/ 3, 


\Wk\V  [x  :  hWk(x )  7^  /iw.(x)  |wfc_i  •  a; |  <  bk-i 


<  ^  K[hWk(x)  7^  ^.(a:)]  +  C14 \J \ Wk \ (d log( | Wk | / d)  +  \og(l/5k)), 

( x,y)£Wk 

for  a  universal  constant  C14  >  0.  Combined  with  the  above,  and  the  fact  that  (11.3)  implies 


log(l/4)  <  7^77 1^4 1  and 


dlog(|Wfc|/d)  <  dlog 


8C3C5C7  log 


<  dlog  -  3  log(8  max{c3,  l}c5)c5(ilog 


<  31og(8max{c3,  1})kt2  krnk  < 


2r\—k  31og(8max{c3,l})  2 


K2\Wk  I, 


we  have 


\Wk\V  (a;  :  hWk(x)  7^  hw*  (x)  |wfe_i  •  x\  < 


<  c^Wt\  +  cJm  ( 31°g(8ma,fa.l})tt  +  _5L|wy 

V  V  c2c7  C2C5C7 


,w  1  ,  /  3  log  (8  max{c3, 1})  1 

—  Ci3  +  Ci4-t/ - 1 - 

,  V  C2C7  C2C5C7 


Thus,  letting  ci5  =  (  ci3  +  c14  J 31°g(8^{c3’1})  +  7^7777  ) ,  we  have 


V  (a;  :  hWk(x)  4  hw*(x)  |wfc_i  •  x\  <  h-i'j  <  ci5re. 


(11.6) 


207 


Next,  note  that  \\vk—wk_i  ||  =  \/\\vk\\2  +  1  -  cos(ttV{x  :  hWk(x)  7^  h^^x))).  Thus, 
one  implication  of  the  fact  that  \\vk  —  Wfc~i||  <  rk  is  that  <  cos(nV(x  :  hWk(x)  7^ 

hwk^ix)))',  since  the  left  hand  side  is  positive,  we  have  V(x  :  hWk{x )  7^  hWk_1(x ))  <  1/2.  Addi¬ 
tionally,  by  differentiating,  one  can  easily  verify  that  for  0  e  [0, 7r],  x  (->•  \Jx2  -|-  1  —  2x  cos(0)  is 
minimized  at  x  =  cos(0),  in  which  case  \J x2  +  1  —  2a;  cos (0)  =  sin(0).  Thus,  ||ufc  —  Wfc_i||  > 
sin(7r'P(a;  :  /^(x)  7^  hWk_x  (x))).  Since  \\vk  -  wk-i\\  <  rk,  we  have  sin^Plx  :  hWk(x)  d 
hWk_  1(x)))  <  rk.  Since  sin(7rx)  >  x  for  all  x  G  [0, 1/2],  combining  this  with  the  fact  (proven 
above)  that  V(x  :  hWk(x )  ^  h^^x))  <  1/2  implies  V(x  :  hWk(x )  ^  h^^x))  <  rk. 

In  particular,  we  have  that  both  V(x  :  hWk{x)  d  hWk_1  (x))  <  rk  and  V(x  :  hw*(x)  d 
hWk  l(x))  <  2~k~ 3  <  rk.  Now  Lemma  11.12  implies  that,  for  any  universal  constant  c  >  0, 
there  exists  a  corresponding  universal  constant  d  >  0  such  that 


V  ^x  :  hWk (x)  ^  hWk_x{x)  and  | wk- 


1  •  X 


V-y 


Vd) 


and 


V  (  x  :  hw*(x)  d  hWk_ i(x)  and  \wk-i  ■  x\  >  c'-^=  )  <  crk, 


so  that  (by  a  union  bound) 


V  (  x  :  hWk{x)  ^  hw*(x)  and  \wk_i  ■  x\  >  d^= 

<  V  I  x  :  hWk  (x)  7^  hyj^x)  and  \wk _i  •  x|  >  d  —= 


Vd 

+  V  (  X  :  hw* {x)  7^  hyj^x)  and  |wfe_i  •  x|  >  c'-^= 


<  2cr k. 


In  particular,  letting  c7  =  dcw/2,  we  have  d V,  —  bk-i-  Combining  this  with  (1 1.6),  Lemma  11.13, 
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and  a  union  bound,  we  have  that 


V  (. x  :  hWk(x)  f  hw*(x)) 

<  V  (x  :  hWk(x)  f  hw*{x)  and  |wfe-i  •  x|  >  6fc_i)  +  P  (x  :  hWk{x)  f  /i„,.(x)  and  |wfe-i  •  x|  <  6fc_i) 

<  2crk  +  V  (x  :  hWk(x)  f  hw*(x)  \wk-i  ■  x\  <  bk-^  V  (x  :  \wk-i  ■  x\  <  bk- 1) 

<  2crfc  +  cl5KCsbk-iVd  =  (25cci0  +  C15KC3C725)  T 


)  —  k— 4 


Taking  c  =  ^  and  k  =  26C^C7C1_,  we  have  V{x  :  hWk(x)  f  hw*{x))  <  2  k  4,  as  required. 

By  a  union  bound,  this  occurs  with  probability  at  least  1  —  (4/3 )Sk.  □ 

Proof  of  Theorem  11.8.  If  e  >  ^^7^,  the  result  trivially  holds,  since  then  T  <  40°f 27  yfedT. 

2 

Otherwise,  suppose  e  <  _wf227d  • 

Fix  any  j  e  N.  Lemma  11.10  implies  that,  with  probability  at  least  1  —  Ved,  the  wq  returned 
in  Step  0  of  ABL(M(i  —  1))  satisfies  V(x  :  hWo(x )  f  h*M(i-i)+mo+i(x ))  A  1/16.  Taking  this  as 
a  base  case.  Lemma  11.15  (with  5  =  \fed)  then  inductively  implies  that,  with  probability  at  least 

\fal 


_  riog2(l/a)l 

1  —  \Zed,  —  (4/3) 


k= 1 


(riog2(4/a)l  -  kf 


(  00  1 

>  1-Va/  1-  (4/3)  Y,  12  )  ^  1  -  2^’ 


V 


1= 2 


every  k  e  (0,1,...,  flog2(l/a)l }  has 


V(x  :  h.Wk(x)  f  h*M{i_1]+mo+1(x))  <  2 


—k—4 


(11.7) 


and  furthermore  the  number  of  labels  requested  during  ABL (M(i  —  1))  total  to  at  most  (for 
appropriate  universal  constants  ci,  c2) 

riog2(l/a)l 


m0 


g2(1/«)l  /  /i  \  [log2(l/a)l  /  /  r,  /,  ,  x-i  r  \  2  \  \ 


<  c2d  log2  (  — 
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In  particular,  by  a  union  bound,  (11.7)  implies  that  for  every  k  e  (1, . . [log2(l/a)]},  every 
me  |m(*  -  1)  +  Yjk:]=o  'm3  +  !,•••,  -  1)  +  ]4=o  ]  has 

V[x  :  Kh.Ax)  +  h*m(x )) 

<  :  /!«,»_!  (x)  V  ^(i-lj+mo+l^))  +P(x  :  ^M (i— l)+m0+l  4)  7^ 

<  2~k~3  +  eM. 


Thus,  noting  that 


flog2(l/a)l  /  /  1  \ 

m  =  ^2  mk  =  ©  m + i°g  ( )  +  ^2  ^kd  log 


k= 0 


k= 1 


ed 


=  0(^logf^))=0  \/7log(^ 


we  have  that  the  expected  number  of  labels  requested  among  {yM(i-i)+i,  •  •  • ,  2/m;}  is  at  most 

c2dlog2  +  2\fedM  =  O  ^\4dlog3/2  , 

and  the  expected  number  of  mistaken  predictions  among  points  {xM(i-i)+i,  •  •  • ,  xMi}  is  at  most 

_  _  /  ri°g2(1/ a)i 

2 VedM  +  (1  —  2 Ved)  |  mo  +  ( 2~k ~3  +  eM)mk 

\  k=l 

=  O  ^VedM  +  dlog2  +  eilf2j  =  O  (Ve. dlog3/ 2  . 

These  imply  that  the  expected  number  of  labels  requested  among  {yi, ... ,  yT},  for  any  given 
T,  is  at  most 


O  [  Ved  log3//2  (^\  M 


T 

M 


=  O  (  W!  log3/2  (  —  )  T  )  , 


and  the  expected  number  of  mistaken  predictions  among  points  {xu  . . . ,  xT}  is  at  most 


O  (  Ved  log3//2  f  —  )  M 


T 

M 


=  0{  \4dlog3/2  f  -  )  T 


□ 
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Remark:  The  original  work  of  [Koby  Crammer  and  Vaughan,  2010]  additionally  allowed  for 
some  number  K  of  “jumps”:  times  t  at  which  et  =  1.  Note  that,  in  the  above  algorithm,  since  the 
influence  of  each  sample  is  localized  to  the  predictors  trained  within  that  “batch”  of  M  instances, 
the  effect  of  allowing  such  jumps  would  only  change  the  bound  on  the  number  of  mistakes  to 
0  ^ sfdeT  +  j .  This  compares  favorably  to  the  result  of  [Koby  Crammer  and  Vaughan, 
2010],  which  is  roughly  O  ^[de)l/AT  +  p^A'j.  However,  the  result  of  [Koby  Crammer  and 
Vaughan,  2010]  was  proven  for  a  slightly  more  general  setting,  allowing  distributions  V  that 
are  not  quite  uniform  (though  they  do  require  a  relation  between  the  angle  between  any  two 
separators  and  the  probability  mass  they  disagree  on,  similar  to  that  holding  for  the  uniform 
distribution,  which  seems  to  require  the  distributions  are  not  too  far  from  uniform).  It  is  not  clear 
whether  Theorem  11.8  can  be  generalized  to  this  larger  family  of  distributions. 

11.6  General  Analysis  of  Sublinear  Mistake  Bounds:  Passive 
Learning 

First,  consider  the  following  general  lemma. 

Lemma  11.16.  Suppose  et  — >  0.  Then  there  exists  an  increasing  sequence  in  N  with 

Tj  =  1  such  that  lim^oo  Ti+1  —  Tj  =  oo  while  Hindoo  1  et  —  0. 

Proof.  Let  Tj  —  1  ,T2  —  2,  and  72  —  e\.  Inductively,  for  each  1  >  2,  if  ^T-i+2(Ti_i-ri_2)-i  ^  < 
Ti-i/2,  set  Tj  =  Tj_i  +  2(Ti_1  -  Tj_2)  and  et;  otherwise,  set  Tt  =  Tj_x  +  (Tl_l  - 

Tj_2)  and  =  7 ,_!.  Since  any  fixed  value  1  6  N  has  limT^00  Ym=t  e*  =  we  know  there 
exist  an  infinite  number  of  values  i  6  N  with  7 j  <  7j_x/2,  at  which  point  we  then  also  have 
Tt  -  Ti_i  =  2(Ti_1  -  Tj_2)  >  Tj_!  —  Tj_2;  together  these  facts  imply  the  stated  properties.  □ 

Suppose  C  is  the  concept  space,  and  that  C  has  finite  VC  dimension  d.  Consider  the  following 
passive  learning  algorithm,  based  on  the  sequnce  Tj  implied  by  Lemma  11.16. 
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0.  Let  /ii  be  any  element  of  C 

1.  For  i  —  1,2,... 

2.  For  t  =  Tj, . . . ,  Ti+i  —  1 

3.  Predict  Vt  =  hfXf  as  the  prediction  for  the  value  of  Yt 

4.  Let  hi+1  =  ERM(C,  {{XT.,YT^ 

•  •  •  ,  (^Ti+1-l,*Ti+1-l)}) 

Theorem  11.17.  If  et  —>  0,  and  { T, }  fx ,  is  the  sequence  guaranteed  to  exist  by  Lemma  11.16, 
then  the  above  algorithm  has  an  expected  cumulative  number  of  mistakes  o{T). 


Proof.  Consider  any  value  i  e  N,  and  let  hi+ 1  =  hf  ] .  By  a  Chernoff  bound,  with  probability 
at  least  1  -  l/(Tl+1  -  Tt), 


Ti+ i  —  l  Ti+ 1  Ti+ 1 

I[hi+i(Xt)  h*(Xt)\  <  log2(Tj+1  —  Tj)  +  2e  efc. 

t=Ti  t=Ti  + 1  k=t 

Furthermore,  standard  VC  analysis  implies  that,  with  probability  at  least  1  —  l/(Ti+1  —  T,  j, 

V/i,  g  e  C, 

Ti+ i  —  l 

I [h(Xt)  f  g(Xt)]  >  (Ti+1  -  T,)V(x  :  h(x)  f  g(x))  -  c\J (rflog(Ti+1  —  Tj))(Ti+1  —  Tf), 

t=Ti 


for  some  numerical  constant  c  >  0.  Thus,  on  these  events,  any  h  e  C  with  V ( x  :  h(x)  f 


hi+i(x))  >  2 


iog2(Ti+1-ri)+2eEtf^+i  EfcS1 

Ti+ 1  —  Tj 


+  C 


dlog(Ti+i-ri) 
Fi+i  Ft 


must  have 


Fi+i— l 

X  i{h(xt)  ±  huxt)] 

t=Tt 

Fs+i  — 1  T.i+i—  1 

>  53  I[ft(A,)  #  fci+1(A,)]  -  53  I[A,+1(A'i)#/i;(.Y,)] 

t=Tf  t=Tj 

Ti+ 1  Fj+i 

>  log2(Tj+i  -  Ti )  +  2e  £ 

t=Tj+l  /c=£ 

^i+i  — 1 

>  53  l[hi+l(Xt)  ^  h‘(Xt)} 

t=Tt 
Ti+i  —  1 

t=Ti 
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Therefore,  by  a  union  bound,  with  probability  at  least  1  —  2/(Ti+1  —  7) ) , 


:  ft++i(z)  7^  hi+1(x))  <  2 


log2(Tj 


i+l 


Ti)  +  2e  Yjt=Ti+ 1  efc 


i+l 


T, 


+  C-t 


/dlog^+i-Ti) 


T, 


i+l 


T, 


so  that 


E 


"P(a;  :  hi+1(x)  7^  hi+1(x)) 


log 2<Ti+1  -  T,)  +  2e  E2,+t,+i  ES‘  <* 


■'T4 


<  2 


T 


i+l 


T; 


+  C\ 


'd  log  (7) 


i+l 


TV 


T, 


i+l 


T 


+ 


T 


i+l 


T 


Denote  by  pl+\  the  value  on  the  right  hand  side  of  this  inequality.  Since  Tl+\  —  T,:  — )■  00 
and  T  i‘_T  Ym,=t%+\  J2k=t  ek  <  Ylt=Ti+iet  0  (guaranteed  by  Lemma  11.16),  we  have 
linii-KX) Pi+i  =  0.  Since  E[^."1I[/i,+i(Xt)  ^  h*(Xt)}}  <  ESt^+i  EjUri+1+i  efc,  we 
have 


E 


T+ 2  —  1 

E  i[Li(+)?^  !>;(+)] 


*=T+i 


Ti+2  — 1 

Ti+2—  1 

<  E 

X]  l[hi+1(Xt)  ^  hi+1(Xt)} 

+  E 

t=Ti+ 1 

t=Ti+ 1 

T+2-1  t 

<  (Ti+ 2  -  T?:+i)E[T,(a;  :  ^+i(x)  7^  /ii+i(x))]  +  E  E  * 

t—Ti- pi+i  fc— Ti_|_i+1 


?i+2  —  1  t 

<  (^i+2  —  Ti+i)Pi+i  +  E  E 

t—Ti- i-i+l  A:— 7j_|_i+l 


Since  pi+i  -+  0,  we  have  ( Ti+2  -  Ti+1)pi+1  =  o(Ti+2  -  Ti+1),  and  since  Ti+2  -  Ti+1  -+  00, 
we  have  ^=1(Ti+2-Ti+i)pi+i  =  o(Ti).  Furthermore,  since  Y^.=p+[+\  EjUt4+1+i  efc  <  (?i+2- 
Ti+ 1)  ESt^+i  e*  =  o(Tj:+2-ri+i),  and  Ti+2-Ti+1  -+  00,  we  have  ELi  ESt+i+i  ELt4+1+i 
o{Tj).  Altogether,  we  have  that  the  expected  sum  of  mistakes  up  to  time  T  (which  is  the  sum 
of  the  expected  numbers  of  mistakes  within  the  component  segments  Ti+ 1, . . . ,  Ti+2  —  1)  grows 
sublinearly  in  T.  □ 
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11.7  General  Analysis  under  Varying  Drift  Rate:  Inefficient 


Passive  Learning 


Consider  the  following  algorithm. 


0.  For  T  —  1,2,... 

1.  Let  mT  =  argniinme{1)T_1}  ELr-m+i  e *  +  dl°s^/d) 

2.  Let  hT  =  ERM(C,  {(XT_mT,YT_mr), . . . ,  (XT_i, 

3.  Predict  Yr  =  hT(XT)  as  the  prediction  for  the  value  of  YT 


Theorem  11.18.  The  above  algorithm  makes  an  expected  number  of  mistakes  among  the  first  T 
instances  that  is 
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Proof.  It  suffices  to  show  that,  for  any  T  G  N,  and  any  m  G  {1, . . . ,  T  —  1},  the  classifier 
h  =  ERM(C,  {( XT_m ,  YT_m), . . . ,  {XT_U  yT_i)})  has 
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for  some  universal  constant  d  G  (0,  oo).  Minimization  over  m  in  the  theorem  statement  then 
follows  from  the  fact  that  rnr  minimizes  this  expression  over  m  by  definition.  The  result  will 
then  follow  by  linearity  of  expectations. 

Let  £  =  '}ft=T-m+ i  et  ■  By  a  Chernoff  bound,  with  probability  at  least  1  —  5 , 
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By  Lemma  1 1.1,  on  an  additional  event  of  probability  at  least  1  —  5, 
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for  an  appropriate  numerical  constant  c"  G  [1,  oo).  Taking  5  =  d/m,  this  is  at  most 
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Since  this  holds  with  probability  1  —  25  =  1  —  2 d/m,  and  V(x  :  h(x)  f  h*T_m{x ))  <  1  always, 
we  have 
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In  particular,  we  have  the  following  corollary. 

Corollary  11.19.  7/£f=  ;|  c,  =  o(T),  dzen  die  expected  number  of  mistakes  made  by  the  above 
algorithm  is  also  o(T). 
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Proof.  Let  f3t(m)  =  max 
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so  that  Theorem  11.18  (combined  with  the  fact  that  the  probability  of  a  mistake  on  a  given  round 
is  at  most  1)  implies  the  expected  number  of  mistakes  is  0(J2t= i  minme{ir..;t_i}  f3t(m )  A  1).  Let 
m't  =  argmin 

Fix  any  M  e  N.  For  a  given  t,  if  m't  <  M,  then  it  must  be  that  J2l=t-M+ i  . 
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we  have  that 
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Since  this  is  true  of  any  M  e  N,  we  have  that 
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so  that  the  expected  number  of  mistakes  is  o(T),  as  claimed. 
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Chapter  12 


Surrogate  Losses  in  Passive  and  Active 
Learning 


Abstract 

1  Active  learning  is  a  type  of  sequential  design  for  supervised  machine  learning,  in  which  the 
learning  algorithm  sequentially  requests  the  labels  of  selected  instances  from  a  large  pool  of 
unlabeled  data  points.  The  objective  is  to  produce  a  classifier  of  relatively  low  risk,  as  measured 
under  the  0-1  loss,  ideally  using  fewer  label  requests  than  the  number  of  random  labeled  data 
points  sufficient  to  achieve  the  same.  This  work  investigates  the  potential  uses  of  surrogate  loss 
functions  in  the  context  of  active  learning.  Specifically,  it  presents  an  active  learning  algorithm 
based  on  an  arbitrary  classification-calibrated  surrogate  loss  function,  along  with  an  analysis  of 
the  number  of  label  requests  sufficient  for  the  classifier  returned  by  the  algorithm  to  achieve  a 
given  risk  under  the  0-1  loss.  Interestingly,  these  results  cannot  be  obtained  by  simply  optimizing 
the  surrogate  risk  via  active  learning  to  an  extent  sufficient  to  provide  a  guarantee  on  the  0-1  loss, 
as  is  common  practice  in  the  analysis  of  surrogate  losses  for  passive  learning.  Some  of  the  results 
have  additional  implications  for  the  use  of  surrogate  losses  in  passive  learning. 

'The  chapter  is  based  on  joint  work  with  Steve  Hanneke. 
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12.1  Introduction 


In  supervised  machine  learning,  we  are  tasked  with  learning  a  classifier  whose  probability  of 
making  a  mistake  (i.e.,  error  rate)  is  small.  The  study  of  when  it  is  possible  to  learn  an  accurate 
classifier  via  a  computationally  efficient  algorithm,  and  how  to  go  about  doing  so,  is  a  subtle  and 
difficult  topic,  owing  largely  to  nonconvexity  of  the  loss  function:  namely,  the  0-1  loss.  While 
there  is  certainly  an  active  literature  on  developing  computationally  efficient  methods  that  suc¬ 
ceed  at  this  task,  even  under  various  noise  conditions,  it  seems  fair  to  say  that  at  present,  many 
of  these  advances  have  not  yet  reached  the  level  of  robustness,  efficiency,  and  simplicity  required 
for  most  applications.  In  the  mean  time,  practitioners  have  turned  to  various  heuristics  in  the 
design  of  practical  learning  methods,  in  attempts  to  circumvent  these  tough  computational  prob¬ 
lems.  One  of  the  most  common  such  heuristics  is  the  use  of  a  convex  surrogate  loss  function 
in  place  of  the  0-1  loss  in  various  optimizations  performed  by  the  learning  method.  The  con¬ 
vexity  of  the  surrogate  loss  allows  these  optimizations  to  be  performed  efficiently,  so  that  the 
methods  can  be  applied  within  a  reasonable  execution  time,  even  with  only  modest  computa¬ 
tional  resources.  Although  classifiers  arrived  at  in  this  way  are  not  always  guaranteed  to  be  good 
classifiers  when  performance  is  measured  under  the  0-1  loss,  in  practice  this  heuristic  has  often 
proven  quite  effective.  In  light  of  this  fact,  most  modern  learning  methods  either  explicitly  make 
use  of  a  surrogate  loss  in  the  formulation  of  optimization  problems  (e.g.,  SVM),  or  implicitly 
optimize  a  surrogate  loss  via  iterative  descent  (e.g.,  AdaBoost).  Indeed,  the  choice  of  a  surrogate 
loss  is  often  as  fundamental  a  part  of  the  process  of  approaching  a  learning  problem  as  the  choice 
of  hypothesis  class  or  learning  bias.  Thus  it  seems  essential  that  we  come  to  some  understanding 
of  how  best  to  make  use  of  surrogate  losses  in  the  design  of  learning  methods,  so  that  in  the 
favorable  scenario  that  this  heuristic  actually  does  work,  we  have  methods  taking  full  advantage 
of  it. 

In  this  work,  we  are  primarily  interested  in  how  best  to  use  surrogate  losses  in  the  context 
of  active  learning,  which  is  a  type  of  sequential  design  in  which  the  learning  algorithm  is  pre- 
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sented  with  a  large  pool  of  unlabeled  data  points  (i.e.,  only  the  covariates  are  observable),  and 
can  sequentially  request  to  observe  the  labels  (response  variables)  of  individual  instances  from 
the  pool.  The  objective  in  active  learning  is  to  produce  a  classifier  of  low  error  rate  while  access¬ 
ing  a  smaller  number  of  labels  than  would  be  required  for  a  method  based  on  random  labeled 
data  points  (i.e.,  passive  learning)  to  achieve  the  same.  We  take  as  our  starting  point  that  we 
have  already  committed  to  use  a  given  surrogate  loss,  and  we  restrict  our  attention  to  just  those 
scenarios  in  which  this  heuristic  actually  does  work.  We  are  then  interested  in  how  best  to  make 
use  of  the  surrogate  loss  toward  the  goal  of  producing  a  classifier  with  relatively  small  error  rate. 
To  be  clear,  we  focus  on  the  case  where  the  minimizer  of  the  surrogate  risk  also  minimizes  the 
error  rate,  and  is  contained  in  our  function  class. 

We  construct  an  active  learning  strategy  based  on  optimizing  the  empirical  surrogate  risk  over 
increasingly  focused  subsets  of  the  instance  space,  and  derive  bounds  on  the  number  of  label 
requests  the  method  requires  to  achieve  a  given  error  rate.  Interestingly,  we  find  that  the  basic 
approach  of  optimizing  the  surrogate  risk  via  active  learning  to  a  sufficient  extent  to  guarantee 
small  error  rate  generally  does  not  lead  to  as  strong  of  results.  In  fact,  the  method  our  results 
apply  to  typically  does  not  optimize  the  surrogate  risk  (even  in  the  limit).  The  insight  leading 
to  this  algorithm  is  that,  if  we  are  truly  only  interested  in  achieving  low  0-1  loss,  then  once  we 
have  identified  the  sign  of  the  optimal  function  at  a  given  point,  we  need  not  optimize  the  value 
of  the  function  at  that  point  any  further,  and  can  therefore  focus  the  label  requests  elsewhere.  As 
a  byproduct  of  this  analysis,  we  find  this  insight  has  implications  for  the  use  of  certain  surrogate 
losses  in  passive  learning  as  well,  though  to  a  lesser  extent. 

Most  of  the  mathematical  tools  used  in  this  analysis  are  inspired  by  recently-developed  tech¬ 
niques  for  the  study  of  active  learning  [Hanneke,  2009,  2011,  Koltchinskii,  2010],  in  conjunction 
with  the  results  of  Bartlett,  Jordan,  and  McAuliffe  [2006]  bounding  the  excess  error  rate  in  terms 
of  the  excess  surrogate  risk,  and  the  works  of  Koltchinskii  [2006]  and  Bartlett,  Bousquet,  and 
Mendelson  [2005]  on  localized  Rademacher  complexity  bounds. 
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12.1.1  Related  Work 


There  are  many  previous  works  on  the  topic  of  surrogate  losses  in  the  context  of  passive  learning. 
Perhaps  the  most  relevant  to  our  results  below  are  the  work  of  Bartlett,  Jordan,  and  McAuliffe 
[2006]  and  the  related  work  of  Zhang  [2004].  These  develop  a  general  theory  for  converting 
results  on  excess  risk  under  the  surrogate  loss  into  results  on  excess  risk  under  the  0-1  loss. 
Below,  we  describe  the  conclusions  of  that  work  in  detail,  and  we  build  on  many  of  the  basic 
definitions  and  insights  pioneered  in  these  works. 

Another  related  line  of  research,  initiated  by  Audibert  and  Tsybakov  [2007],  studies  “plug-in 
rules,”  which  make  use  of  regression  estimates  obtained  by  optimizing  a  surrogate  loss,  and  are 
then  rounded  to  {—1,  +1}  values  to  obtain  classifiers.  They  prove  results  under  smoothness  as¬ 
sumptions  on  the  actual  regression  function,  which  (remarkably)  are  often  better  than  the  known 
results  for  methods  that  directly  optimize  the  0-1  loss.  Under  similar  conditions,  Minsker  [2012] 
studies  an  analogous  active  learning  method,  which  again  makes  use  of  a  surrogate  loss,  and 
obtains  improvements  in  label  complexity  compared  to  the  passive  learning  method  of  Audibert 
and  Tsybakov  [2007];  again,  the  results  for  this  method  based  on  a  surrogate  loss  are  actually 
better  than  those  derived  from  existing  active  learning  methods  designed  to  directly  optimize 
the  0-1  loss.  The  works  of  Audibert  and  Tsybakov  [2007]  and  Minsker  [2012]  raise  interesting 
questions  about  whether  the  general  analyses  of  methods  that  optimize  the  0-1  loss  remain  tight 
under  complexity  assumptions  on  the  regression  function,  and  potentially  also  about  the  design 
of  optimal  methods  for  classification  when  assumptions  are  phrased  in  terms  of  the  regression 
function. 

In  the  present  work,  we  focus  our  attention  on  scenarios  where  the  main  purpose  of  using  the 
surrogate  loss  is  to  ease  the  computational  problems  associated  with  minimizing  an  empirical 
risk,  so  that  our  statistical  results  are  typically  strongest  when  the  surrogate  loss  is  the  0-1  loss 
itself.  Thus,  in  the  specific  scenarios  studied  by  Minsker  [2012],  our  results  are  generally  not 
optimal;  rather,  the  main  strength  of  our  analysis  lies  in  its  generality.  In  this  sense,  our  results 
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are  more  closely  related  to  those  of  Bartlett,  Jordan,  and  McAuliffe  [2006]  and  Zhang  [2004] 
than  to  those  of  Audibert  and  Tsybakov  [2007]  and  Minsker  [2012].  That  said,  we  note  that 
several  important  elements  of  the  design  and  analysis  of  the  active  learning  method  below  are 
already  present  to  some  extent  in  the  work  of  Minsker  [2012], 

There  are  several  interesting  works  on  active  learning  methods  that  optimize  a  general  loss 
function.  Beygelzimer,  Dasgupta,  and  Langford  [2009]  and  Koltchinskii  [2010]  have  both  pro¬ 
posed  active  learning  methods,  and  analyzed  the  number  of  label  requests  the  methods  make 
before  achieving  a  given  excess  risk  for  that  loss  function.  The  former  method  is  based  on 
importance  weighted  sampling,  while  the  latter  makes  clear  an  interesting  connection  to  local 
Rademacher  complexities.  One  natural  idea  for  approaching  the  problem  of  active  learning  with 
a  surrogate  loss  is  to  run  one  of  these  methods  with  the  surrogate  loss.  The  results  of  Bartlett, 
Jordan,  and  McAuliffe  [2006]  allow  us  to  determine  a  sufficiently  small  value  7  such  that  any 
function  with  excess  surrogate  risk  at  most  7  has  excess  error  rate  at  most  e.  Thus,  by  evalu¬ 
ating  the  established  bounds  on  the  number  of  label  requests  sufficient  for  these  active  learning 
methods  to  achieve  excess  surrogate  risk  7,  we  immediately  have  a  result  on  the  number  of  label 
requests  sufficient  for  them  to  achieve  excess  error  rate  e.  This  is  a  common  strategy  for  con¬ 
structing  and  analyzing  passive  learning  algorithms  that  make  use  of  a  surrogate  loss.  However, 
as  we  discuss  below,  this  strategy  does  not  generally  lead  to  the  best  behavior  in  active  learning, 
and  often  will  not  be  much  better  than  simply  using  a  related  passive  learning  method.  Instead, 
we  propose  a  new  method  that  typically  does  not  optimize  the  surrogate  risk,  but  makes  use  of  it 
in  a  different  way  so  as  to  achieve  stronger  results  when  performance  is  measured  under  the  0-1 
loss. 


12.2  Definitions 

Let  (X,  Bx)  be  a  measurable  space,  where  X  is  called  the  instance  space ;  for  convenience,  we 
suppose  this  is  a  standard  Borel  space.  Let  y  =  {  —  1,  +1},  and  equip  the  space  X  x  y  with  its 
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product  cr-algebra:  B  =  Bx®2y .  LetM  =  MU{— oo,  oo},  let  T*  denote  the  set  of  all  measurable 
functions  g  :  X  — >  M,  and  let  T  C  P*,  where  P  is  called  the  function  class.  Throughout,  we  fix 
a  distribution  Vxy  over  X  xy,  and  we  denote  by  V  the  marginal  distribution  of  Vxy  °ver  X.  In 
the  analysis  below,  we  make  the  usual  simplifying  assumption  that  the  events  and  functions  in  the 
definitions  and  proofs  are  indeed  measurable.  In  most  cases,  this  holds  under  simple  conditions 
on  P  and  Vxy  [see  e.g.,  van  der  Vaart  and  Wellner,  2011];  when  this  is  not  the  case,  we  may 
turn  to  outer  probabilities.  However,  we  will  not  discuss  these  technical  issues  further. 

For  any  h  G  P*,  and  any  distribution  P  over  X  x  y,  denote  the  error  rate  by  er(/q  P )  = 
P((x,y)  :  sign(h(x))  f  y );  when  P  =  Vxy,  we  abbreviate  this  as  er (h)  =  er {1i\Vxy)-  Also, 
let  rj(X;  P )  be  a  version  of  P(Y  =  1\X),  for  (A",  Y)  ~  P;  when  P  =  Vxy,  abbreviate  this  as 
y(X)  =  r/iX:  Vxy)-  In  particular,  note  that  er (7/:  P )  is  minimized  at  any  h  with  sign  (777c))  = 
sign.(rj(x]  P )  —  1/2)  for  all  x  G  X.  In  this  work,  we  will  also  be  interested  in  certain  conditional 
distributions  and  modifications  of  functions,  specified  as  follows.  For  any  measurable  U  C  X 
with  VilA)  >  0,  define  the  probability  measure  Vu(-)  =  'P.vy('/7  x  >')  =  Vxy(-XA  x  >/) /V(U): 
that  is,  Vu  is  the  conditional  distribution  of  (A",  Y)  ~  Vxy  given  that  X  e  U.  Also,  for  any 
h,g  e  T* ,  define  the  spliced  function  hu,g(x )  =  h{x)\i{x)  +  g{x)\x\u{x).  For  a  set  V.  C  P*, 
denote  1-lu.g  =  {hu,g  ■  h  G  H}. 

For  any  77  C  T'\  define  the  region  of  sign-disagreement  I)IS(77.)  =  {x  G  X  :  3h.  y  G 
V.  s.t.  sign (77 a'))  7  sign(^(x))},  and  the  region  of  value-disagreement  DISF (77)  =  {x  G 
X  :  3h,g  G  "77  s.t.  h(x)  f  g(x)},  and  denote  by  DlS(77)  =  DIS(77)  x  y  and  DlSF(77)  = 
DISF(77)  x  y.  Additionally,  we  denote  by  ["77]  =  {/  G  T*  :  \/x  G  X,  inf h&L  h(x)  —  f(x)  — 
suPhcH  h(x)}  the  minimal  bracket  set  containing  "77. 

Our  interest  here  is  learning  from  data,  so  let  Z  =  { (A" | ,  Y] ) .  (X2,  Y2), . . .}  denote  a  sequence 
of  independent  Vxy -distributed  random  variables,  referred  to  as  the  labeled  data  sequence,  while 
{ A" ] ,  X2, . . .}  is  referred  to  as  the  unlabeled  data  sequence.  For  m  G  N,  we  also  denote  Zm  = 
{(A7,  Y\ ( Xm ,  Ym)}.  Throughout,  we  will  let  <5  G  (0, 1/4)  denote  an  arbitrary  confidence 
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parameter,  which  will  be  referenced  in  the  methods  and  theorem  statements. 

The  active  learning  protocol  is  defined  as  follows.  An  active  learning  algorithm  is  initially 
permitted  access  to  the  sequence  X1,X2, ...  of  unlabeled  data.  It  may  then  select  an  index  «|  6  N 
and  request  to  observe  Yn ;  after  observing  Yt] ,  it  may  select  another  index  i2  G  N,  request  to 
observe  Yl2,  and  so  on.  After  a  number  of  such  label  requests  not  exceeding  some  specified  bud¬ 
get  n,  the  algorithm  halts  and  returns  a  function  h  G  T* .  Formally,  this  protocol  specifies  a  type 
of  mapping  that  maps  the  random  variable  Z  to  a  function  h,  where  h  is  conditionally  indepen¬ 
dent  of  Z  given  Xi,X2, . . .  and  (ii,  Yit),  ( i2 ,  Yi2 ), . . . ,  (in,  Yln ) ,  where  each  ik  is  conditionally 
independent  of  Z  and  ik+1,  given  Xu  X2, . . .  and  (ii,  Yh), . . . ,  (ik_  i, 

12.2.1  Surrogate  Loss  Functions  for  Classification 

Throughout,  we  let  £  :  ffi.  — >  [0,  oo]  denote  an  arbitrary  surrogate  loss  function-,  we  will  primarily 
be  interested  in  functions  £  that  satisfy  certain  conditions  discussed  below.  To  simplify  some 
statements  below,  it  will  be  convenient  to  suppose  £{z)  <  oo.  For  any  g  G  T*  and  dis¬ 

tribution  P  over  X  xy,  let  R t(g;  P)  =  E  [d(g(X)Y)\,  where  (AG  Y)  ~  P;  in  the  case  P  =  Vx  y, 
abbreviate  Rf  g)  =  R e(g;  Vxy)-  Also  define  £  =  1  V  supxe<Y  supheT  maxye{_1)+1j  £(yh(x));  we 
will  generally  suppose  £  <  oo.  In  practice,  this  is  more  often  a  constraint  on  T  than  on  £;  that  is, 
we  could  have  £  unbounded,  but  due  to  some  normalization  of  the  functions  h  G  T,  £  is  bounded 
on  the  corresponding  set  of  values. 

Throughout  this  work,  we  will  be  interested  in  loss  functions  £  whose  point-wise  minimizer 
necessarily  also  optimizes  the  0-1  loss.  This  property  was  nicely  characterized  by  Bartlett,  Jor¬ 
dan,  and  McAuliffe  [2006]  as  follows.  For  r/0  G  [0, 1],  define  £*(r)0)  =  inf ze^(r)0£(z)  +  (1  — 
t]q)£(-z)),  and  £*_ (rj0)  =  inf2efi:a(2%_i)<0 (V(-)  +(!  -  Vo)£{-z)). 

Definition  12.1.  The  loss  £  is  classification-calibrated  if,  V//0  G  [0, 1]  \  {1/2},  £*_  (-//0)  >  £*(qo)- 

In  our  context,  for  A"  ~  V,  tigiX))  represents  the  minimum  value  of  the  conditional  6-risk 
at  X,  so  that  K[£*(ri(X))]  =  inf  hex*  R  e(h),  while  £*_(ri(X))  represents  the  minimum  conditional 
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Prisk  at  X,  subject  to  having  a  sub-optimal  conditional  error  rate  at  X:  i.e.,  sign (h(X))  ^ 
sign  (77  (X)  —  1/2).  Thus,  being  classihcation-calibrated  implies  the  minimizer  of  the  conditional 
Prisk  at  X  necessarily  has  the  same  sign  as  the  minimizer  of  the  conditional  error  rate  at  X. 
Since  we  are  only  interested  here  in  using  l  as  a  reasonable  surrogate  for  the  0-1  loss,  throughout 
the  work  below  we  suppose  i  is  classification-calibrated. 

Though  not  strictly  necessary  for  our  results  below,  it  will  be  convenient  for  us  to  suppose 
that,  for  all  770  G  [0, 1],  this  inhmum  value  £*(77 0)  is  actually  obtained  as  g0£(z* {g0))  +  (1  — 
?7o)^(— z*(rj0))  for  some  z* {//({)  G  R  (not  necessarily  unique).  For  instance,  this  is  the  case 
for  any  nonincreasing  right-continuous  £,  or  continuous  and  convex  £,  which  include  most  of 
the  cases  we  are  interested  in  using  as  surrogate  losses  anyway.  The  proofs  can  be  modified  in  a 
natural  way  to  handle  the  general  case,  simply  substituting  any  z  with  conditional  risk  sufficiently 
close  to  the  minimum  value.  For  any  distribution  P,  denote  h*P(x )  =  z*(r)(x ;  P))  for  all  x  G 
X.  In  particular,  note  that  h*P  obtains  XX  f  (h*  p\  P )  =  infgej-*  R e(g-  P ).  When  P  =  Vxy,  we 
abbreviate  this  as  h*  =  h*PxY.  Furthermore,  if  i  is  classification-calibrated,  then  sign (h*P(x))  = 
sign (r)(x;  P )  — 1/2)  for  all  x  G  X  with  7 j(x;  P )  ^  1/2,  and  hence  er (h*P;  P )  =  iiif/lGjr*  er (h;  P) 
as  well. 

For  any  distribution  P  over  X  x  y,  and  any  h.  g  G  T* .  define  the  loss  distance  V)f  (h.  g:  P)  = 
^/e  \(Jl{h{X)Y)  —  £(g(X)Y)f],  where  (A",  Y)  ~  P.  Also  define  the  loss  diameter  of  a  class 
H  C  P*  as  D P)  =  suphgen  De(h,  g ;  P),  and  the  Arisk  e-minimal  set  of  H  as  di{e\ £,  P )  = 
{h  G  H  :  R e(h;  P)  —  inf g&n  R e(g;  P )  <  e}.  When  P  =  Vxy,  we  abbreviate  these  as  D e(h,  g)  = 
D  t{h,g]VxY),  D  e(P)  =  D 'tty;  Vxy),  and'H(e;£)  =  'H{e\1,'Pxy).  Also,  for  any  h  G  T\ 
abbreviate  hu  =  hup* ,  and  for  any  'H  C  P* .  dehne  %u  —  {hu  :  h  G  di). 

We  additionally  define  related  quantities  for  the  0-1  loss,  as  follows.  Define  the  distance 
AP(h,g)  =  V(x  :  sign(h(x))  ^  sign(^(x)))  and  radius  radius P)  =  suphe^  Ap(/i,  /i*p). 
Also  dehne  the  e-minimal  set  of  H  as  T-L(e;  01,  P)  =  {h  G  H  :  er (h;  P)  —  inf er (<7;  P)  <  e}, 
and  for  r  >  0,  dehne  the  r-ball  centered  at  h  in  P  by  BWiP(/r,  r)  =  {g  G  P  :  A P(h,  g)  <  r}. 
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When  P  =  Vxy,  we  abbreviate  these  as  A (h,  g)  =  AVxy  ( h ,  g ),  radius('H)  =  radius(T(;  Vxy), 
"H(e;oi)  =  y.(£]oi ,Vxy),  and  Bn(h,r)  =  BMiPxy(/i,r);  when  V.  =  T,  further  abbreviate 
B  (h,r)  =  BT(h,r). 

We  will  be  interested  in  transforming  results  concerning  the  excess  surrogate  risk  into  results 
on  the  excess  error  rate.  As  such,  we  will  make  use  of  the  following  abstract  transformation. 
Definition  12.2.  For  any  distribution  P  over  X  x  y,  and  any  e  G  [0, 1],  define 

Tp(e;  P)  =  sup{7  >  0  :  ^(7;  £,  P)  c  P*{£-  01,  P)}  U  {0}. 

Also,  for  any  7  G  [0,  00),  define  the  inverse 

£e(j;  P)  =  inf  {e  >  0  :  7  <  T*(e;  P)}  . 

When  P  =  Vxy,  abbreviate  r^(e)  =  T((£]  Vxy)  and  £^(7)  =  £7(7;  Vxy)- 

By  definition,  Y p  has  the  property  that 

V/i  G  Ve  G  [0, 1],  R e(h)  -  R e(h*)  <  r<(e)  ==►  er (h)  -  er (h*)  <  e.  (12.1) 

In  fact,  Tp  is  defined  to  be  maximal  with  this  property,  in  that  any  Vt  for  which  (12.1)  is  satisfied 
must  have  r^(e)  <  r^e)  for  all  £  G  [0, 1]. 

In  our  context,  we  will  typically  be  interested  in  calculating  lower  bounds  on  R  for  any 
particular  scenario  of  interest.  Bartlett,  Jordan,  and  McAuliffe  [2006]  studied  various  lower 
bounds  of  this  type.  Specifically,  for  (  G  [—1, 1],  define  =  £*_  (77")  —  P  (77),  and 

let  i[ip  be  the  largest  convex  lower  bound  of  on  [0,1],  which  is  well-defined  in  this  context 
[Bartlett,  Jordan,  and  McAuliffe,  2006];  for  convenience,  also  define  fipix)  for  x  G  (1,  00) 
arbitrarily  subject  to  maintaining  convexity  of  ifp.  Bartlett,  Jordan,  and  McAuliffe  [2006]  show 
'Wp  is  continuous  and  nondecreasing  on  (0, 1),  and  in  fact  that  i4^(i)/i  is  nondecreasing  on 
(0, 00).  They  also  show  every  h  G  V*  has  fipf  erih)  — er (h*))  <  R fill)  —  R e(h*),  so  that  fie  <  Bp, 
and  they  find  this  inequality  can  be  tight  for  a  particular  choice  of  Vxy  ■  They  further  study  more 
subtle  relationships  between  excess  1-risk  and  excess  error  rate  holding  for  any  classification- 
calibrated  1.  In  particular,  following  the  same  argument  as  in  the  proof  of  their  Theorem  3,  one 
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can  show  that  if  £  is  classification-calibrated,  every  h  £  T*  satisfies 

The  implication  of  this  in  our  context  is  the  following.  Fix  any  nondecreasing  function  'Ip  : 

[0, 1]  — >  [0,  oo)  such  that  We  >  0, 

Me)  <  radi„s(^(£;  ».))*  (2rat,ius(^.(e;  ol)))  •  <12-2) 

Any  h  £  T*  withR^(/i)—  Ri(h*)  <  Tp(e)  also  has  A  (h,  h*)ipi  ~)  <  ^(£);  combined 

with  the  fact  that  x  ipg(x)/x  is  nondecreasing  on  (0,1),  this  implies  radius(Jr*(er(/r)  — 

er(/i*);oi))^  ( 2radius(”»(Iw-^(fe«);oi)) )  <  ^(£);  this  means  ^t(er(h)  ~  er(^*))  <  ^(£)>  and 

monotonicity  of  \E 'g  implies  er (h)  —  er (h*)  <  e.  Altogether,  this  implies  'Ip(c)  <  Ti(e).  In  fact, 
though  we  do  not  present  the  details  here,  with  only  minor  modifications  to  the  proofs  below, 
when  h*  £  J~.  all  of  our  results  involving  V t  {e)  will  also  hold  while  replacing  T  i(e)  with  any 
nondecreasing  such  that  We  >  0, 

Me)  <  radiuS(^(£;  ».))*  (2radins((F(e;ol)))  >  <12-3> 

which  can  sometimes  lead  to  tighter  results. 

Some  of  our  stronger  results  below  will  be  stated  for  a  restricted  family  of  losses,  originally 
explored  by  Bartlett,  Jordan,  and  McAuliffe  [2006]:  namely,  smooth  losses  whose  convexity 
is  quantified  by  a  polynomial.  Specifically,  this  restriction  is  characterized  by  the  following 
condition. 

Condition  12.3.  T  is  convex ,  with  Wx  £  X,  sup jeJF  |/(x)|  <  B  for  some  constant  B  £  (0,  oo), 
and  there  exists  a  pseudometric  dt  :  [-B,  B]2  — »  [0,  d{]  for  some  constant  df  £  (0,  oo),  and  con¬ 
stants  L ,  Ci  £  (0,  oo)  and  ri  £  (0,  oo]  such  that  Wx,  y  £  [-B,  B],  \i (x)  —  £(y)  \  <  Lde(x,  y)  and 
the  function  Sf(e)  =  inf  {^£(x)  +  \£{y)  —  £{\x  +  \y)  :  x,y  £  [— B,  B],di(x,y )  >  e}  U  {oo} 
satisfies  We  £  [0,  oo),  dfie)  >  Ci£re. 
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In  particular,  note  that  if  J~  is  convex,  the  functions  in  J~  are  uniformly  bounded,  and  i  is 
convex  and  continuous,  Condition  12.3  is  always  satisfied  (though  possibly  with  rg  =  oo)  by 
taking  de(x,  y)  =  \x  -  y\/(AB). 

12.2.2  A  Few  Examples  of  Loss  Functions 

Here  we  briefly  mention  a  few  loss  functions  £  in  common  practical  use,  all  of  which  are 
classification-calibrated.  These  examples  are  taken  directly  from  the  work  of  Bartlett,  Jor¬ 
dan,  and  McAuliffe  [20061,  which  additionally  discusses  many  other  interesting  examples  of 
classification-calibrated  loss  functions  and  their  corresponding  xpg  functions. 

Example  1  The  exponential  loss  is  specified  as  £(x)  =  e~x.  This  loss  function  appears  in 
many  contexts  in  machine  learning;  for  instance,  the  popular  AdaBoost  method  can  be  viewed  as 
an  algorithm  that  greedily  optimizes  the  exponential  loss  [Freund  and  Schapire,  1997].  Bartlett, 
Jordan,  and  McAuliffe  [2006]  show  that  under  the  exponential  loss,  ipg(x)  =  1  —  \J  1  —  x2,  which 
is  tightly  approximated  by  a:2/2  for  small  x.  They  also  show  this  loss  satisfies  the  conditions  on 
£  in  Condition  12.3  with  dg(x,  y)  =  \x  —  y\,  L  —  eB ,  Cg  =  e~B /8,  and  rg  =  2. 

Example  2  The  hinge  loss,  specified  as  £(x)  =  max  {1  —  x,  0},  is  another  common  surrogate 
loss  in  machine  learning  practice  today.  For  instance,  it  is  used  in  the  objective  of  the  Support 
Vector  Machine  (along  with  a  regularization  term)  [Cortes  and  Vapnik,  1995].  Bartlett,  Jordan, 
and  McAuliffe  [2006]  show  that  for  the  hinge  loss,  ipg(x)  =  x | .  The  hinge  loss  is  Lipschitz  con¬ 
tinuous,  with  Lipschitz  constant  1.  However,  for  the  remaining  conditions  on  i  in  Condition  12.3, 
any  x,y  <  1  have  \t(x)  +  \£{y)  =  t(\x  +  \y),  so  that  Sg(s)  =  0;  hence,  rg  =  oo  is  required. 

Example  3  The  quadratic  loss  (or  squared  loss),  specified  as  £{x)  =  (1  —  x)2,  is  often  used 
in  so-called  plug-in  classifiers  [Audibert  and  Tsybakov,  2007],  which  approach  the  problem  of 
learning  a  classifier  by  estimating  the  regression  function  E[F|X  =  x\  =  2q(x)  —  1,  and  then 
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taking  the  sign  of  this  estimator  to  get  a  binary  classifier.  The  quadratic  loss  has  the  convenient 
property  that  for  any  distribution  P  over  X  x^,  h* p  ( • )  =  2// ( • ;  P)  —  1 ,  so  that  it  is  straightforward 
to  describe  the  set  of  distributions  P  satisfying  the  assumption  h*P  e  P .  Bartlett,  Jordan,  and 
McAuliffe  [2006]  show  that  for  the  quadratic  loss,  'ipe(x)  =  x2 .  They  also  show  the  quadratic 
loss  satisfies  the  conditions  on  i  in  Condition  12.3,  with  L  =  2 (B  +  1),  Cg  =  1/4,  and  rg  =  2. 
In  fact,  they  study  the  general  family  of  losses  t{x)  —  |1  —  x\p,  for  p  e  (1,  oo),  and  show  that 
ipg(x)  and  rg  exhibit  a  range  of  behaviors  varying  with  p. 


Example  4  The  truncated  quadratic  loss  is  specified  as  £(x)  =  (max{l  —  x,  0})2.  Bartlett, 
Jordan,  and  McAuliffe  [2006]  show  that  in  this  case,  ipg(x)  =  x2.  They  also  show  that,  under 
the  pseudometric  dg(a,  b )  =  |  min{a,  1}  —  min (6, 1} |,  the  truncated  quadratic  loss  satisfies  the 
conditions  on  i  in  Condition  12.3,  with  L  =  2 (B  +  1),  Cg  =  1/4,  and  rg  =  2. 


12.2.3  Empirical  £-Risk  Minimization 


For  any  m  G  N,  g  :  X  — »  M,  and  S  =  {(^i,  yi), _ ,  (xm,  ym)}  G  (X  x  y)m,  dehne  the  empirical 

E-risk  as  R g(g-  S )  =  m~l  Y^T=  i  9(xi)Vi )•  At  times  it  will  be  convenient  to  keep  track  of  the 
indices  for  a  subsequence  of  Z,  and  for  this  reason  we  also  overload  the  notation,  so  that  for 
any  Q  =  {(R,  yi), .  • . ,  (im,  ym)}  e  (N  x  y)m,  we  dehne  S[Q]  =  {(X^yt), . . . ,  (Xim,ym)} 
and  Rf(fj:Q)  =  Rfifj:  ,5'[0|).  For  completeness,  we  also  generally  define  0)  =  0.  The 
method  of  empirical  6-risk  minimization,  here  denoted  by  ERM g(H,Zm),  is  characterized  by 
the  property  that  it  returns  h  =  &vgmmhen  Rg(h]  Zm).  This  is  a  well-studied  and  classical 
passive  learning  method,  presently  in  popular  use  in  applications,  and  as  such  it  will  serve  as  our 
baseline  for  passive  learning  methods. 
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12.2.4  Localized  Sample  Complexities 


The  derivation  of  localized  excess  risk  bounds  can  essentially  be  motivated  as  follows.  Suppose 
we  are  interested  in  bounding  the  excess  (-risk  of  ERMR'H.  Zm).  Further  suppose  we  have  a 
coarse  guarantee  U((fH,m )  on  the  excess  f-risk  of  the  h  returned  by  ERM/CH.  Zm):  that  is, 
R i(h)  —  R e(h*)  <  Ui(H,  m).  In  some  sense,  this  guarantee  identifies  a  set  %'  C  %  of  functions 
that  a  priori  have  the  potential  to  be  returned  by  ER.ARfTE  Zm)  (namely,  %'  =  'H(Up(li..  m);  £)), 
while  those  in  do  not.  With  this  information  in  hand,  we  can  think  of  'H'  as  a  kind  of 

effective  function  class,  and  we  can  then  think  of  ERM t(H,  Zrn)  as  equivalent  to  ERJVR('H/,  Zm) . 
We  may  then  repeat  this  same  reasoning  for  ERA/R('H/,  Zm),  calculating  Up[P' .  m)  to  determine 
a  set  H"  =  'H'iUfiV,  rri)\ £)  C  'H'  of  potential  return  values  for  this  empirical  minimizer,  so 
that  ERMf('H/,  Zrn)  =  ERM Zrn),  and  so  on.  This  repeats  until  we  identify  a  fixed-point 
set  of  functions  such  that  H^00\Ue('H('00\m)-,  t)  =  so  that  no  further  reduction  is 

possible.  Following  this  chain  of  reasoning  back  to  the  beginning,  we  find  that  ERER  (7f ,  Z.m)  = 
ERM  so  that  the  function  h  returned  by  ERM^("H,  Zm)  has  excess  f-risk  at  most 

,  m),  which  may  be  significantly  smaller  than  Up('H.  m ),  depending  on  how  refined  the 
original  m)  bound  was. 

To  formalize  this  fixed-point  argument  for  ERER  fTf ,  Zm ) ,  Koltchinskii  [2006]  makes  use  of 
the  following  quantities  to  define  the  coarse  bound  UpfHpm)  [see  also  Bartlett,  Bousquet,  and 
Mendelson,  2005,  Gine  and  Koltchinskii,  2006].  For  any  %  C  [J7],  m  e  N,  s  G  [1,  oo),  and  any 
distribution  P  on  X  x  y,  letting  Q  ~  Pm,  define 


(t>e('H;m,  P) 


E 


sup  (R t(h]  P )  -  R e(g]  P))  -  (R t(h;  Q ) 

_h,g&i 


ReM))  , 


rz  kz.s 

P,  m,  s)  =  KMU- m,  P)  +  K2De(n ;  P)  J-  + 

V  m  m 

Ue(n;  p,  m,  s)  —  K  ( m,  P)  +  De(H;  P) \[^+-)  , 

\  V  m  m  J 

where  I\\ ,  K> ,  /v':!,  and  K  are  appropriately  chosen  constants. 

We  will  be  interested  in  having  access  to  these  quantities  in  the  context  of  our  algorithms; 
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however,  since  V\y  is  not  directly  accessible  to  the  algorithm,  we  will  need  to  approximate 
these  by  data-dependent  estimators.  Toward  this  end,  we  define  the  following  quantities,  again 
taken  from  the  work  of  Koltchinskii  [2006].  For  £  >  0,  let  Z£  =  {j  G  Z  :  2J  >  e}.  For  any 

n  c  [7],qe  N,  and  S  =  {(Xl,  Vl), . . . ,  (xq,  yq)}  G  (X  x  {-1,+1})9,  let  S')  =  {/i  G 
"H  :  R|(/i;  S')  —  inf&eW  R^(<?;  S')  <  e};  then  for  any  sequence  H  =  {£k}qk=1  G  {  —  1,  +1}9,  and 
any  s  G  [1,  oo),  define 

1  q 

M‘ H]  s,  5)  =  sup  -  Y]  &  ■  (Z{h{xk)yk)  -  £{g{xk)yk)) , 

h,aen  q  ^ 

1  q 

D£CH;  S')2  =  sup  -  V  {£{h(xk)yk)  -  £(g(xk)yk))2  , 

/G  7^2/s 

-  + - -• 

q  q 

For  completeness,  define  0, 0)  =  D^(%;  0)  =  0,  and  0, 0,  s)  =  752£s. 

The  above  quantities  (with  appropriate  choices  of  K\ ,  Jl 2,  /Tj,  and  K)  can  be  formally  related 
to  each  other  and  to  the  excess  (-risk  of  functions  in  H  via  the  following  general  result;  this 
variant  is  due  to  Koltchinskii  [2006]. 

Lemma  12.4.  For  any  'H  C  [J7],  s  G  [1,  oo),  distribution  P  over  X  x  y,  and  any  m  G  N,  if 
Q  ~  pm  and  H  =  {(7, . . .  ,£m}  ~  Uniform({  — 1,  +l})m  are  independent,  and  h*  G  %  has 
R  e(h*;  P )  =  inffteW  R  /  (7/:  P ),  then  with  probability  at  least  1  —  6e~s,  the  following  claims  hold. 

Vh  g  n,  R t(h;  P)  -  R t(h*;  P)  <  R t{h\  Q)  -  R e(h*-,  Q )  +  Ue(H;  P,  m,  s), 

V/i  G  n,  Mh ;  Q)  -  inf  R*(<?;  Q)  <  R e(h;  P)  -  R e(h*;  P )  +  Ue(H;  P,  m,  s ), 

g&i 

UeiFL]  P,  m,  s )  <  Q,  S,  s)  <  C4(P;  P,  m,  s). 

We  typically  expect  the  U,  U,  and  U  quantities  to  be  roughly  within  constant  factors  of  each 
other.  Following  Koltchinskii  [2006]  and  Gine  and  Koltchinskii  [2006],  we  can  use  this  result 
to  derive  localized  bounds  on  the  number  of  samples  sufficient  for  ERJVR('H,  Zrn)  to  achieve  a 
given  excess  (-risk.  Specifically,  for  H  C  [J],  distribution  P  over  X  x  y,  values  7, 71, 72  >  0, 
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s  £  [1,  oo),  and  any  function  s  :  (0,  oo)2  — >  [1,  oo),  define  the  following  quantities. 

M^(7i,72;^,P,s)  =  min  \m  £  N  :  Ue('H('y2’,Z,P)’,P,m,s)  <  71}  , 

IvR(7;P,P,s)  =  sup  / 2,  y-,n,  -/')), 

y>7 

M*(7i,7 %'H,P,s)  =  min  |m  £  N  :  P);  P,m,  s)  <  71 1 , 

Me(r,U,P,s)  =  sup  Mti(y/2,y-,'H,P,  s(7, 7'))- 

7;>7 

These  quantities  are  well-defined  for  71,72,7  >  0  when  limm_).00  m,  P)  =  0.  In  other 
cases,  for  completeness,  we  define  them  to  be  00. 

In  particular,  the  quantity  M^y;  P,  Vxy,s)  is  used  in  Theorem  12.6  below  to  quantify  the 
performance  of  ERM^P,  Zm).  The  primary  practical  challenge  in  calculating  M/(y:  'H.  P,s ) 
is  handling  the  <^(%(y';  £,  P);  m,  P)  quantity.  In  the  literature,  the  typical  (only?)  way  such 
calculations  are  approached  is  by  first  deriving  a  bound  on  m,  P )  for  every  'H'  C  'H 

in  terms  of  some  natural  measure  of  complexity  for  the  full  class  H  (e.g.,  entropy  numbers) 
and  some  very  basic  measure  of  complexity  for  H':  most  often  D P)  and  sometimes  a 
seminorm  of  an  envelope  function  for  H'.  After  this,  one  then  proceeds  to  bound  these  basic 
measures  of  complexity  for  the  specific  subsets  P(y';  £,  P),  as  a  function  of  7'.  Composing  these 
two  results  is  then  sufficient  to  bound  7';  £,  P);m,P).  For  instance,  bounds  based  on  an 
entropy  integral  tend  to  follow  this  strategy.  This  approach  effectively  decomposes  the  problem 
of  calculating  the  complexity  of  P( 7';  £.  P )  into  the  problem  of  calculating  the  complexity  of  H 
and  the  problem  of  calculating  some  much  more  basic  properties  of  'H( 7';  £,  P ).  See  [Bartlett, 
Jordan,  and  McAuliffe,  2006,  Gine  and  Koltchinskii,  2006,  Koltchinskii,  2006,  van  der  Vaart  and 
Wellner,  1996],  or  Section  12.5  below,  for  several  explicit  examples  of  this  technique. 

Another  technique  often  (though  not  always)  used  in  conjunction  with  the  above  strategy 
when  deriving  explicit  rates  of  convergence  is  to  relax  D^('H(7/;  £,  P);  P)  to  £,  P);  P) 

or  D(?(['H](7/;f,  P);  P).  This  relaxation  can  sometimes  be  a  source  of  slack;  however,  in  many 
interesting  cases,  such  as  for  certain  losses  i  [e.g.,  Bartlett,  Jordan,  and  McAuliffe,  2006],  or 
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even  certain  noise  conditions  [e.g.,  Mammen  and  Tsybakov,  1999,  Tsybakov,  2004],  this  relaxed 
quantity  can  still  lead  to  nearly  tight  bounds. 

For  our  purposes,  it  will  be  convenient  to  make  these  common  techniques  explicit  in  the 
results.  In  later  sections,  this  will  make  the  benefits  of  our  proposed  methods  more  explicit, 
while  still  allowing  us  to  state  results  in  a  form  abstract  enough  to  capture  the  variety  of  specific 
complexity  measures  most  often  used  in  conjunction  with  the  above  approach.  Toward  this  end, 
we  have  the  following  definition. 

Definition  12.5.  For  every  distribution  P  over  X  x  y,  let  <j>e( a ,  FL]  m,  P )  be  a  quantity  defined 
for  every  a  £  [0,  oo],  Ft  C  P|,  and  m  £  N,  such  that  the  following  conditions  are  satisfied  when 
h*P  £  U. 


If  0  <  a  <  a',  FI  C  FL'  C  [P],U  C  X,  and m!  <  rn, 

then  fief  a,  FLu,h*P]m,  P )  <  fie(a',  Ft m',  P).  (12.4) 

Vo-  >  D e(FL;  P),4>e(FL\  m,  P )  <  (j>e(cr,  FL ;  m,  P ).  (12.5) 

For  instance,  most  bounds  based  on  entropy  integrals  can  be  made  to  satisfy  this.  See  Sec¬ 
tion  12.5.3  for  explicit  examples  of  quantities  fie  from  the  literature  that  satisfy  this  definition. 
Given  a  function  fie  of  this  type,  we  define  the  following  quantity  for  m  £  N,  s  £  [1,  oo), 
£  £  [0,  oo],  FL  C  [V7],  and  a  distribution  P  over  X  x  y. 


UfiFL,(',P,‘m,s) 

=  K  (ii(Dt([H](C;  e,  n  P),K- m,  P)  +  D,([^](C;  £,  P);  P)*[S-  +  -)  . 

Note  that  when  h*P  £  FL,  since  Dg([FL\(T,  £■>  P)\ P)  >  F>e(FL(y ;£,P);P),  Definition  12.5  im¬ 
plies  fig(FL(T,  V-i  P)’,  m>,  P)  <  fiJe(De([FL\(ry,£,  P);P),  FL{t,  £,P);P,m),  and  furthermore  FL{t,  £,P)  C 
n  SO  that  it(De([U]  (7;  £,  P);  P),  H(r,  t,  P);  p,  rn)  <  (7;  £,  py,  P),  U\  p,  m).  Thus, 

Ue(FL(rJ,P)]P,m,s )  <Ue(FL{r,£,P),r,P,m,s)  <  Ue(FL,r,  P,m,  s).  (12.6) 
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Furthermore,  when  h* p  G  P,  for  any  measurable  U  C  U'  C  X,  any  V  >  7  >  0,  and  any 
P'  C  [P]  with  p  C  P', 

Ue(Hu,h*PjTi s )  <  Ug('Hu'th*p,'yf';  P,  m,  s ).  (12.7) 

Note  that  the  fact  that  we  use  Dr([P](7;  P);  P)  instead  of  Df(P(7;  P);  P)  in  the  defini¬ 

tion  of  Ui  is  crucial  for  these  inequalities  to  hold;  specifically,  it  is  not  necessarily  true  that 
Ve(Uu,h*  p(t;  p)\ P)  <  P);  P),  but  it  is  always  the  case  that  [Hu,h*  p](Yi  P)  ^ 

[^,h*p](7;f,P)  when/iV  G  [P],  so  that  MPw,ft*p](7;  P);  P)  <  D<([^>ft.p](7;  P);  P). 

Finally,  for  PC  [P],  distribution  P  over  A''  x  y,  values  7, 71, 72  >  0,  s  G  [1,  00),  and  any 
functions  :  (0,  oo)2  — >  [1,  00),  define 

^(71, 72;  P,  P,  s)  =  min  jm  G  N  :  fp(P,  72;  P,  m,  s)  <  71 1 , 

M^7;P,P,s)  =  sup  M£(7'/2,7';P,P,s(7,7')). 

7' >7 

For  completeness,  define  Mf  (71, 72;  P,  P,  s)  =  00  when  (p(P,  72;  P,  m,  s )  >  71  for  every  m  G 

N. 

It  will  often  be  convenient  to  isolate  the  terms  in  Up  when  inverting  for  a  sufficient  m,  thus 
arriving  at  an  upper  bound  on  M^.  Specifically,  define 

M*(7i,72;P,P,  s)  =  min  [m  G  N  :  D^([P](72;^,  P);  P)\ +  —  <  71}  , 

(  V  m  m  J 

M<(7i,72;P,P)  =  min  jm  G  N  :  0/ (D*([P](72;^,P);P),P;P,m)  <  71 J  . 

This  way,  for  c  =  1/(2  Ji ),  we  have 

M*(7i,  72;  P,P,s)  <  max  jM£(c7i,72;P,P),M£(c7i,72;P,P,s)J  .  (12.8) 
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We  will  express  our  main  abstract  results  below  in  terms  of  the  incremental  values  M^i,  72;  %,  Vxy ,  s) ; 
the  quantity  IvR (7;  V.  Vxy,  s)  will  also  be  useful  in  deriving  analogous  results  for  ER\R.  When 
h* P  G  Ft,  (12.6)  implies 

Me(r,n,p,s)  <Me(r,n,p,s)  <Mt(r,n,p,s).  (12.10) 

12.3  Methods  Based  on  Optimizing  the  Surrogate  Risk 

Perhaps  the  simplest  way  to  make  use  of  a  surrogate  loss  function  is  to  try  to  optimize  R i(h)  over 
h  G  T .  until  identifying  h  G  T  with  R ,e(h)  —  R e(h*)  <  r ((e),  at  which  point  we  are  guaranteed 
er (h)  —  er (h*)  <  e.  In  this  section,  we  briefly  discuss  some  known  results  for  this  basic  idea, 
along  with  a  comment  on  the  potential  drawbacks  of  this  approach  for  active  learning. 

12.3.1  Passive  Learning:  Empirical  Risk  Minimization 

In  the  context  of  passive  learning,  the  method  of  empirical  i 7 -risk  minimization  is  one  of  the  most- 
studied  methods  for  optimizing  R ((h)  over  h  G  J~.  Based  on  Lemma  12.4  and  the  above  defini¬ 
tions,  one  can  derive  a  bound  on  the  number  of  labeled  data  points  m  sufficient  for  ERM/ (V.  Zrn) 
to  achieve  a  given  excess  error  rate.  Specifically,  the  following  theorem  is  due  to  Koltchinskii 
[2006]  (slightly  modified  here,  following  Gine  and  Koltchinskii  [2006],  to  allow  for  general  s 
functions).  It  will  serve  as  our  baseline  for  comparison  in  the  applications  below. 

Theorem  12.6.  Fix  any  function  s  :  (0,  oo)2  — *  [1,  oo).  If  h*  G  V,  then  for  any  m  > 

M^r^e);  V,  Vxy,  s),  with  probability  at  least  1  —  Qe~s^re^,23\  ERJVR (V,  Zm)  pro- 

duces  a  function  h  such  that  er  (h)  —  er  (h*)  <  £. 

12.3.2  Negative  Results  for  Active  Learning 

As  mentioned,  there  are  several  active  learning  methods  designed  to  optimize  a  general  loss 
function  [Beygelzimer,  Dasgupta,  and  Langford,  2009,  Koltchinskii,  2010].  However,  it  turns 
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out  that  for  many  interesting  loss  functions,  the  number  of  labels  required  for  active  learning  to 
achieve  a  given  excess  surrogate  risk  value  is  not  significantly  smaller  than  that  sufficient  for 
passive  learning  by  ERJVR. 

Specifically,  consider  a  problem  with  X  =  {x0,  aq},  let  z  G  (0, 1/2)  be  a  constant,  and  for 
£  G  (0,  z),  \QtV({xi})  =  e/{2  z),  V{{xq})  =  1  —  V({xi}),  and  suppose  T  and£  are  such  that  for 
r)(xi)  =  1/2  +  z  and  any  rj(x0)  G  [4/6,  5/6],  we  have  h*  G  T .  For  this  problem,  any  function  h 
withsign(/i(xi))  ^  +1  has  er(h)  —  er(/i*)  >  e,  sothatT^e)  <  (e/{2z)){tL (r)(x\))  —  Z*(r}(xi))); 
when  i  is  classification-calibrated  and  i  <  oo,  this  is  ce,  for  some  ^-dependent  c  G  (0,  oo).  Any 
function  h  with  R g(h)  —  R e(h*)  <  ce  for  this  problem  must  have  R(  (h:  V{xoy)  —  R g(h*;  V{xo})  < 
ce/V({x o})  =  O(e).  Existing  results  of  Hanneke  and  Yang  [2010]  (with  a  slight  modification 
to  rescale  for  r/(x0)  G  [4/6,  5/6])  imply  that,  for  many  classification-calibrated  losses  Z,  the 
minimax  optimal  number  of  labels  sufficient  for  an  active  learning  algorithm  to  achieve  this  is 
@(l/e).  Hanneke  and  Yang  [2010]  specifically  show  this  for  losses  l  that  are  strictly  positive, 
decreasing,  strictly  convex,  and  twice  differentiable  with  continuous  second  derivative;  however, 
that  result  can  easily  be  extended  to  a  wide  variety  of  other  classification-calibrated  losses,  such 
as  the  quadratic  loss,  which  satisfy  these  conditions  in  a  neighborhood  of  0.  It  is  also  known 
[Bartlett,  Jordan,  and  McAuliffe,  2006]  (see  also  below)  that  for  many  such  losses  (specifically, 
those  satisfying  Condition  12.3  with  ry  =  2),  @(l/e)  random  labeled  samples  are  sufficient  for 
ERJVR  to  achieve  this  same  guarantee,  so  that  results  that  only  bound  the  surrogate  risk  of  the 
function  produced  by  an  active  learning  method  in  this  scenario  can  be  at  most  a  constant  factor 
smaller  than  those  provable  for  passive  learning  methods. 

In  the  next  section,  we  provide  an  active  learning  algorithm  and  a  general  analysis  of  its  per¬ 
formance  which,  in  the  special  case  described  above,  guarantees  excess  error  rate  less  than  £  with 
high  probability,  using  a  number  of  label  requests  0(log(l/e)  loglog(l/e)).  The  implication  is 
that,  to  identify  the  improvements  achievable  by  active  learning  with  a  surrogate  loss,  it  is  not 
sufficient  to  merely  analyze  the  surrogate  risk  of  the  function  produced  by  a  given  active  learning 
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algorithm.  Indeed,  since  we  are  not  particularly  interested  in  the  surrogate  risk  itself,  we  may 
even  consider  active  learning  algorithms  that  do  not  actually  optimize  R ((h)  over  h  e  J~  (even 
in  the  limit). 


12.4  Alternative  Use  of  the  Surrogate  Loss 

Given  that  we  are  interested  in  £  only  insofar  as  it  helps  us  to  optimize  the  error  rate  with  compu¬ 
tational  efficiency,  we  should  ask  whether  there  is  a  method  that  sometimes  makes  more  effective 
use  of  i  in  terms  of  optimizing  the  error  rate,  while  maintaining  essentially  the  same  computa¬ 
tional  advantages.  The  following  method  is  essentially  a  relaxation  of  the  methods  of  Koltchin- 
skii  [2010]  and  Hanneke  [2012],  Similar  results  should  also  hold  for  analogous  relaxations  of  the 
related  methods  of  Balcan,  Beygelzimer,  and  Langford  [2006],  Dasgupta,  Hsu,  and  Monteleoni 
[2007a],  Balcan,  Beygelzimer,  and  Langford  [2009],  and  Beygelzimer,  Dasgupta,  and  Langford 
[2009], 

Algorithm  1: 

Input:  surrogate  loss  t,  unlabeled  sample  budget  u,  labeled  sample  budget  n 

Output:  classifier  h 

0.  V  i —  J~,  Q  i —  TTl  i —  1,  t  i —  0 

1.  While  m  <  u  and  t  <  n 

2.  m  <—  m  +  1 

3.  If  Xm  e  DIS(R) 

4.  Request  label  Ym  and  let  0  0  U  {(m,  Ym)},  t  •<—  t  +  1 

5.  Iflog2(m)GN 

6.  V  <-  [li  E  V  :  R 4{h-  Q )  -  infgev  Re(g;  Q )  <  T)(R;  Q,  m )} 

7.  Q  4—  {} 

8.  Return  h  =  argminhgy  R g(h;  Q ) 
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The  intuition  behind  this  algorithm  is  that,  since  we  are  only  interested  in  achieving  low 
error  rate,  once  we  have  identified  sign(/i*(x))  for  a  given  x  G  X,  there  is  no  need  to  further 
optimize  the  value  E[£(h(X)Y)\X  =  x\.  Thus,  as  long  as  we  maintain  h*  G  V,  the  data  points 
Xm  DIS(V)  are  typically  less  informative  than  those  Xrn  G  DIS(V).  We  therefore  focus  the 
label  requests  on  those  Xrn  G  DIS(V),  since  there  remains  some  uncertainty  about  sign (h* (Xm)) 
for  these  points.  The  algorithm  updates  V  periodically  (Step  6),  removing  those  functions  h 
whose  excess  empirical  risks  (under  the  current  sampling  distribution)  are  relatively  large;  by 
setting  this  threshold  T)  appropriately,  we  can  guarantee  the  excess  empirical  risk  of  h*  is  smaller 
than  T(.  Thus,  the  algorithm  maintains  h*  G  V  as  an  invariant,  while  focusing  the  sampling 
region  DIS(V). 

In  practice,  the  set  V  can  be  maintained  implicitly,  simply  by  keeping  track  of  the  constraints 
(Step  6)  that  define  it;  then  the  condition  in  Step  3  can  be  checked  by  solving  two  constraint  sat¬ 
isfaction  problems  (one  for  each  sign);  likewise,  the  value  inf^y  R<(g:  Q )  in  these  constraints, 
as  well  as  the  final  h,  can  be  found  by  solving  constrained  optimization  problems.  Thus,  for 
convex  loss  functions  and  convex  classes  of  function,  these  steps  typically  have  computationally 
efficient  realizations,  as  long  as  the  T)  values  can  also  be  obtained  efficiently.  The  quantity  7}  in 
Algorithm  1  can  be  defined  in  one  of  several  possible  ways.  In  our  present  abstract  context,  we 
consider  the  following  definition.  Let  {^.}/.Gn  denote  independent  Rademacher  random  variables 
(i.e.,  uniform  in  {—1,  +1}),  also  independent  from  Z\  these  should  be  considered  internal  ran¬ 
dom  bits  used  by  the  algorithm,  which  is  therefore  a  randomized  algorithm.  For  any  q  G  N  U  {0} 
and  Q  =  {(ii,  j/i), (iq,  yq)}  G  (N  x  {— 1,+1})«,  let  S[Q]  =  {(Xh,  Vl), ....,  (Xiq,  yq)}, 

-iQ]  =  mi=v  F°r  s  g  [!>  °°). define 

Ue{W,Q,s)  =  Ue{n-,S[Q],E[Q],s). 

Then  we  can  define  the  quantity  T)  in  the  method  above  as 

fe(H;Q,m)  =  Ue(H;Q,s(m)),  (12.11) 
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for  some  s  :  N  — >  [1,  oo).  This  definition  has  the  appealing  property  that  it  allows  us  to  interpret 
the  update  in  Step  6  in  two  complementary  ways:  as  comparing  the  empirical  risks  of  functions  in 
V  under  the  conditional  distribution  given  the  region  of  disagreement  "Pdis m,  and  as  comparing 
the  empirical  risks  of  the  functions  in  Vdis( v)  under  the  original  distribution  Vxy-  Our  abstract 
results  below  are  based  on  this  definition  of  fa.  This  can  sometimes  be  problematic  due  to  the 
computational  challenge  of  the  optimization  problem  in  the  definitions  of  fa  and  IT.  There  has 
been  considerable  work  on  calculating  and  bounding  fa  for  various  classes  fa  and  losses  i  [e.g., 
Bartlett  and  Mendelson,  2002,  Koltchinskii,  2001],  but  it  is  not  always  feasible.  However,  the 
specific  applications  below  continue  to  hold  if  we  instead  take  fa  based  on  a  well-chosen  upper 
bound  on  the  respective  fa  function,  such  as  those  obtained  in  the  derivations  of  those  respective 
results  below;  we  provide  descriptions  of  such  efficiently-computable  relaxations  for  each  of  the 
applications  below  (though  in  some  cases,  these  bounds  have  a  mild  dependence  on  Vxy  via 
certain  parameters  of  the  specific  noise  conditions  considered  there). 

We  have  the  following  theorem,  which  represents  our  main  abstract  result.  The  proof  is 
included  in  Appendix  12.6. 

Theorem  12.7.  Fix  any  functions  :  N  — *  [l,oo).  Let  je  =  —  |~log2(f)],  define  u:)f-2  =  ujt-i  —  T 
and  for  each  integer  j  >  let  fa )  =  fa{fi^(22~fa\  oi)Dis(jr(£f(22-j);01))>  Uj  =  DIS(Jy),  and 
suppose  Uj  G  N  satisfies  log2(wj)  €  N  and 

Uj  >  2M^(2_J’_1,  22~j]  faj,  VxY,s(uj))  V  i  V  2mj_2.  (12.12) 

Suppose  h*  G  fa.  For  any  £  G  (0, 1)  and  s  G  [1,  oo),  letting  j£  =  [log2(l/Tg(e))],  if 

je 

u  >  Uje  and  n  >  s  +  2e  TiUfiuj, 

i=h 

then,  with  arguments  i,  u,  and  n,  Algorithm  1  uses  at  most  u  unlabeled  samples  and  makes  at 
most  n  label  requests,  and  with  probability  at  least 

1°§2  (uje  ) 

1  —  2”s  —  6e"®(2i), 

i= 1 
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returns  a  function  h  with  er  (h)  —  er  (h*)  <  e. 

The  number  of  label  requests  indicated  by  Theorem  12.7  can  often  (though  not  always)  be  sig¬ 
nificantly  smaller  than  the  number  of  random  labeled  data  points  sufficient  for  ERJVR  to  achieve 
the  same,  as  indicated  by  Theorem  12.6.  This  is  typically  the  case  when  V{Uj)  — >  0  as  j  — *  oo. 
When  this  is  the  case,  the  number  of  labels  requested  by  the  algorithm  is  sublinear  in  the  number 
of  unlabeled  samples  it  processes;  below,  we  will  derive  more  explicit  results  for  certain  types  of 
function  classes  J7,  by  characterizing  the  rate  at  which  VilAf)  vanishes  in  terms  of  a  complexity 
measure  known  as  the  disagreement  coefficient. 

For  the  purpose  of  calculating  the  values  1 VR  in  Theorem  12.7,  it  is  sometimes  convenient  to 
use  the  alternative  interpretation  of  Algorithm  1,  in  terms  of  sampling  Q  from  the  conditional 
distribution  Pdis(v)-  Specifically,  the  following  lemma  allows  us  to  replace  calculations  in  terms 
of  Tj  and  Vxy  with  calculations  in  terms  of  P(cA2  1  J,);°i)  and  Vms(T:i)-  Its  proof  is  included 
in  Appendix  12.6 

Lemma  12.8.  Let  ft  be  any  function  satisfying  Definition  12.5.  Let  P  be  any  distribution  over 
X  x  y.  For  any  measurable  U  C  X  x  y  with  P{U)  >  0,  define  Pu(-)  =  P(-\U).  Also,  for  any 
a  >  0,  H  C  [T7],  and  m  G  N,  if  P  (DISF("H))  >  0,  define 


4>'e(a,n;m,P)  = 


32 


(  inf 

\  u=u'xy-. 
\U'DD1SF(H) 


P{u)ii 


n-  r(l/2 )P{U)rri\,Pu 


(12.13) 


and  otherwise  define  dfio.  77;  m,  P)  =  0.  Then  the  function  df  also  satisfies  Definition  12.5. 

Plugging  this  cp't  function  into  Theorem  12.7  immediately  yields  the  following  corollary,  the 
proof  of  which  is  included  in  Appendix  12.6. 

Corollary  12.9.  Fix  any  function  s  :  N  — *  [1,  oo).  Let  ji  =  —  [log2  ((')],  define  Uje-2  =  = 

1,  and  for  each  integer  j  >  j(,  let  Tj  and  Uj  be  as  in  Theorem  12.7,  and  if  V  (77, )  >  0,  suppose 
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Uj  G  N  satisfies  log2  (uj )  G  N  and 

uj  —  4P {IA3 )  M i  y  -p Jfifi y^ji  'Puj  i 5 ( uj ) ^  ^  uj- 1  V  2uj-2-  (12.14) 

IfViUj)  =  0,  /et  Uj  G  N  satisfy  log2(%)  G  N  and  Uj  >  K£s(uj) 2J+2  V  Uj  V  2 Suppose 
h*  G  J7.  For  any  e  G  (0, 1)  s  G  [l,oo),  letting  je  =  |"log2(l/r^(e))],  if 

Je 

u  >  Uje  and  n  >  s  +  2e  V(Uj)uj , 

3=31 

then,  with  arguments  l,  u,  and  n,  Algorithm  1  uses  at  most  u  unlabeled  samples  and  makes  at 
most  n  label  requests,  and  with  probability  at  least 

1°§2  (Uje) 

1  —  2~s  —  ^  6e-®(2i), 

returns  a  function  h  with  er  (h)  —  er  (h*)  <  £. 

Algorithm  1  can  be  modified  in  a  variety  of  interesting  ways,  leading  to  related  methods  that 
can  be  analyzed  analogously.  One  simple  modification  is  to  use  a  more  involved  bound  to  define 
the  quantity  T).  For  instance,  for  Q  as  above,  and  a  function  s  :  (0,  oo)  x  Z  x  N  — »  [1,  oo),  one 
could  define 

Tt(H;  Q,m)  =  ( 3/2)q~ 1  inf  |a  >  0  :  Vk  G  ZA, 

Ut  (U  (?>q-l2k-l-t,S[Q})  ;Q,s(X,k,m))  <  2fc"V1}, 

for  which  one  can  also  prove  a  result  similar  to  Lemma  12.4  [see  Gine  and  Koltchinskii,  2006, 
Koltchinskii,  2006].  This  definition  shares  the  convenient  dual-interpretations  property  men¬ 
tioned  above  about  UfiFl]  Q,  s(m));  furthermore,  results  analogous  to  those  above  for  Algorithm 
1  also  hold  under  this  definition  (under  mild  restrictions  on  the  allowed  s  functions),  with  only  a 
few  modifications  to  constants  and  event  probabilities  (e.g.,  summing  over  the  k  G  argument 
to  s  in  the  probability,  while  setting  the  A  argument  to  2  for  the  largest  j  with  Uj  <  T). 

The  update  trigger  in  Step  5  can  also  be  modified  in  several  ways,  leading  to  interesting  re¬ 
lated  methods.  One  possibility  is  that,  if  we  have  updated  the  V  set  k  —  1  times  already,  and 
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the  previous  update  occurred  at  m  =  mk_ i,  at  which  point  V  =  Vk_  \ ,  Q  =  Qk~\  (before 
the  update),  then  we  could  choose  to  update  V  a  kth  time  when  log2(m  —  mk- i)  G  N  and 
Ue(V ;  Q,  s(  m  -  mk_  1))  <  7fe-i/2,  for  some  function  s  :  (0,oo)  xNg  [1,oo), 

where  7fc_i  is  inductively  dehned  as  7fc_i  =  Ut{yk-\\Qk-uKlk-2,mk^  -  mfc_2)) 

(and  70  =  £),  and  we  would  then  use  Ue(V;  Q,$(^k-i,m  —  mk- 1))  for  the  T(  value  in  the  up¬ 
date;  in  other  words,  we  could  update  V  when  the  value  of  the  concentration  inequality  used  in 
the  update  has  been  reduced  by  a  factor  of  2.  This  modification  leads  to  results  quite  similar 
to  those  stated  above  (under  mild  restrictions  on  the  allowed  s  functions),  with  only  a  change 
to  the  probability  (namely,  summing  the  exponential  failure  probabilities  e~s ^  J ]  over  values 
of  j  between  ji  and  j£,  and  values  of  i  between  1  and  log2  («■/));  additionally,  with  this  modifi¬ 
cation,  because  we  check  for  log2(m  —  mfe_i)  G  N  rather  than  log2(m)  G  N,  one  can  remove 
the  “VUj_  i  V  2 Uj-2”  term  in  (12.12)  and  (12.14)  (though  this  has  no  effect  for  the  applications 
below).  Another  interesting  possibility  in  this  vein  is  to  update  when  log2(m  —  mk_  1)  G  N 
and  Ue(V]  Q,s(re(2~k),m  -  mk_  1))^^  <  rf(2"fc).  Of  course,  the  value  r£(2"fc)  is  typi¬ 
cally  not  directly  available  to  us,  but  we  could  substitute  a  distribution-independent  lower  bound 
on  0(2-*'),  for  instance  based  on  the  function  of  Bartlett,  Jordan,  and  McAuliffe  [2006]; 
in  the  active  learning  context,  we  could  potentially  use  unlabeled  samples  to  estimate  a  In¬ 
dependent  lower  bound  on  0(2-fc),  or  even  diam(0)0(2~fc/2diam(0)),  based  on  (12.3),  where 
diam(O)  =  suphiflgV  A (h,g). 

12.5  Applications 

In  this  section,  we  apply  the  abstract  results  from  above  to  a  few  commonly- studied  scenarios: 
namely,  VC  subgraph  classes  and  entropy  conditions,  with  some  additional  mention  of  VC  major 
classes  and  VC  hull  classes.  In  the  interest  of  making  the  results  more  concise  and  explicit,  we 
express  them  in  terms  of  well-known  conditions  relating  distances  to  excess  risks.  We  also 
express  them  in  terms  of  a  lower  bound  on  T tie)  of  the  type  in  (12.2),  with  convenient  properties 
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that  allow  for  closed-form  expression  of  the  results.  To  simplify  the  presentation,  we  often  omit 
numerical  constant  factors  in  the  inequalities  below,  and  for  this  we  use  the  common  notation 
f(x)  <  g(x)  to  mean  that  f(x)  <  cg(x )  for  some  implicit  universal  constant  c  G  (0,  oo). 

12.5.1  Diameter  Conditions 

To  begin,  we  first  state  some  general  characterizations  relating  distances  to  excess  risks;  these 
characterizations  will  make  it  easier  to  express  our  results  more  concretely  below,  and  make 
for  a  more  straightforward  comparison  between  results  for  the  above  methods.  The  following 
condition,  introduced  by  Mammen  and  Tsybakov  [1999]  and  Tsybakov  [2004],  is  a  well-known 
noise  condition,  about  which  there  is  now  an  extensive  literature  [e.g.,  Bartlett,  Jordan,  and 
McAuliffe,  2006,  Hanneke,  2011,  2012,  Koltchinskii,  2006]. 

Condition  12.10.  For  some  a  G  [1,  oo)  and  a  G  [0, 1  \,for  every  g  G  T* , 

A  (g,  h*)  <  a  (er (g)  -  er (h*))a  . 

Condition  12.10  can  be  equivalently  expressed  in  terms  of  certain  noise  conditions  [Bartlett, 
Jordan,  and  McAuliffe,  2006,  Mammen  and  Tsybakov,  1999,  Tsybakov,  2004],  Specifically, 
satisfying  Condition  12.10  with  some  a  <  1  is  equivalent  to  the  existence  of  some  a'  G  [1,  oo) 
such  that,  for  all  £  >  0, 

V  ( x  :  \g(x)  —  1/2|  <  e)  <  a'£a^l~a\ 

which  is  often  referred  to  as  a  low  noise  condition.  Additionally,  satisfying  Condition  12.10  with 
a  =  1  is  equivalent  to  having  some  a'  G  [1,  oo)  such  that 

V{x  :  \i ](x)  -  1/2 1  <  1/a')  =  0, 

often  referred  to  as  a  bounded  noise  condition. 

For  simplicity,  we  formulate  our  results  in  terms  of  a  and  a  from  Condition  12.10.  However, 
for  the  abstract  results  in  this  section,  the  results  remain  valid  under  the  weaker  condition  that 
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replaces  IF*  by  F .  and  adds  the  condition  that  h*  €  T .  In  fact,  the  specific  results  in  this  section 
also  remain  valid  using  this  weaker  condition  while  additionally  using  (12.3)  in  place  of  (12.2), 
as  remarked  above. 

An  analogous  condition  can  be  defined  for  the  surrogate  loss  function,  as  follows.  Similar 
notions  have  been  explored  by  Bartlett,  Jordan,  and  McAuliffe  [2006]  and  Koltchinskii  [2006]. 

Condition  12.11.  For  some  b  e  [1,  oo)  and  (3  G  [0, 1],  for  every  g  e  [F], 

D,  (g,  h*P ;  P)2  <  b  (R,(<?;  P)  -  R fih*  P]  P)f  . 


Note  that  these  conditions  are  always  satisfied  for  some  values  of  a,  b,  a ,  (3,  since  a  =  (3  =  0 
trivially  satisfies  the  conditions.  However,  in  more  benign  scenarios,  values  of  a  and  f3  strictly 
greater  than  0  can  be  satisfied.  Furthermore,  for  some  loss  functions  t.  Condition  12.11  can 
even  be  satisfied  universally,  in  the  sense  that  a  value  of  (3  >  0  is  satisfied  for  all  distributions.  In 
particular,  Bartlett,  Jordan,  and  McAuliffe  [2006]  show  that  this  is  the  case  under  Condition  12.3, 
as  stated  in  the  following  lemma  [see  Bartlett,  Jordan,  and  McAuliffe,  2006,  for  the  proof]. 

Lemma  12.12.  Suppose  Condition  12.3  is  satisfied.  Let  (3  =  minjl,  Xj  and  b  =  {2Cp(fp""l'rf-‘1'{)'r)  ~  '  If. 
Then  every  distribution  P  over  X  x  y  with  h* p  e  \P\  satisfies  Condition  12.11  with  these  values 
ofb  and  (3. 

Under  Condition  12.10,  it  is  particularly  straightforward  to  obtain  bounds  on  r^e)  based  on 
a  function  ^(e)  satisfying  (12.2).  For  instance,  since  x  hg  xfifil/x)  is  nonincreasing  on  (0,  oo) 
[Bartlett,  Jordan,  and  McAuliffe,  2006],  the  function 


=  aeafi>t,  (e1  Q/(2 a)) 


(12.15) 


satisfies  vlR(c)  <  Y Ac)  [Bartlett,  Jordan,  and  McAuliffe,  2006].  Furthermore,  for  classification- 
calibrated  t,  t  in  (12.15)  is  strictly  increasing,  nonnegative,  and  continuous  on  (0, 1)  [Bartlett, 
Jordan,  and  McAuliffe,  2006],  and  has  fi6(0)  =  0;  thus,  the  inverse  fi'^r1(7),  defined  for  all  7  >  0 
by 


vlR  ^7)  =  inf{e  >  0  :  7  <  ^(e)}  U  {!}, 


(12.16) 
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is  strictly  increasing,  nonnegative,  and  continuous  on  (0,  x3/^(l)).  Furthermore,  one  can  easily 
show  x  (->•  ^Tf1(x)/x  is  nonincreasing  on  (0,  oo).  Also  note  that  V7  >  0,  8^(7)  <  '3/^T1  (7). 


12.5.2  The  Disagreement  Coefficient 

In  order  to  more  concisely  state  our  results,  it  will  be  convenient  to  bound  7:,(DIS('H))  by  a  linear 
function  of  radius  (77),  for  radius('H)  in  a  given  range.  This  type  of  relaxation  has  been  used 
extensively  in  the  active  learning  literature  [Balcan,  Hanneke,  and  Vaughan,  2010,  Beygelzimer, 
Dasgupta,  and  Langford,  2009,  Dasgupta,  Hsu,  and  Monteleoni,  2007a,  Friedman,  2009,  Han¬ 
neke,  2007a,  2009,  2011,  2012,  Koltchinskii,  2010,  Mahalanabis,  2011,  Raginsky  and  Rakhlin, 
2011,  Wang,  2011],  and  the  coefficient  in  the  linear  function  is  typically  referred  to  as  the  dis¬ 
agreement  coefficient.  Specifically,  the  following  definition  is  due  to  Hanneke  [2007a,  2011]; 
related  quantities  have  been  explored  by  Alexander  [1987]  and  Gine  and  Koltchinskii  [2006]. 

Definition  12.13.  For  any  r0  >  0,  define  the  disagreement  coefficient  of  a  function  h  :  X  — >  M 
with  respect  to  T  under  V  as 


Oh(r0)  =  sup  V  V  V  1. 

r>ro  T 

Ifh*  G  T,  define  the  disagreement  coefficient  of  the  class  T  as  0  ( r0 )  =  6h*  (r0). 

The  value  of  6(e)  has  been  studied  and  bounded  for  various  function  classes  T  under  various 
conditions  on  V .  In  many  cases  of  interest,  6(e)  is  known  to  be  bounded  by  a  finite  constant 
[Balcan,  Hanneke,  and  Vaughan,  2010,  Friedman,  2009,  Hanneke,  2007a,  2011,  Mahalanabis, 
2011],  while  in  other  cases,  9(e)  may  have  an  interesting  dependence  on  £  [Balcan,  Hanneke, 
and  Vaughan,  2010,  Raginsky  and  Rakhlin,  2011,  Wang,  20111.  The  reader  is  referred  to  the 
works  of  Hanneke  [2011,  2012]  for  detailed  discussions  on  the  disagreement  coefficient. 
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12.5.3  Specification  of  fie 


Next,  we  recall  a  few  well-known  bounds  on  the  ft  function,  which  leads  to  a  more  concrete 
instance  of  a  function  ft  satisfying  Definition  12.5.  Below,  we  let  Q*  denote  the  set  of  measurable 
functions  g  :  X  x  y  — >  M.  Also,  for  Q  C  Q*,  let  F(£)  =  sup9ge  \g\  denote  the  minimal  envelope 
function  for  Q,  and  for  g  G  Q*  let  \\g\\2P  =  f  g2dP  denote  the  squared  L2(P)  seminorm  of  g;  we 
will  generally  assume  F (Q)  is  measurable  in  the  discussion  below. 


Uniform  Entropy.  The  first  bound  is  based  on  the  work  of  van  der  Vaart  and  Wellner  [20111; 
related  bounds  have  been  studied  by  Gine  and  Koltchinskii  [2006],  Gine,  Koltchinskii,  and  Well¬ 
ner  [2003],  van  der  Vaart  and  Wellner  [1996],  and  others.  For  a  distribution  P  over  X  x  y, 
a  set  Q  C  Q*,  and  £  >  0,  let  A f{e,  G,  L2(P ))  denote  the  size  of  a  minimal  e-cover  of  Q  (that 
is,  the  minimum  number  of  balls  of  radius  at  most  e  sufficient  to  cover  Q),  where  distances  are 
measured  in  terms  of  the  L2(P )  pseudo-metric:  (f,g)  i->  \\f  —  g\\P.  For  a  >  0  and  F  e  Q *, 
define  the  function 

J(cr,G,F)  =  sup  [  J 1  +  lnA/'(e||F||Q,  Q,  L2(Q)) de, 

Q  J  o  v 

where  Q  ranges  over  all  finitely  discrete  probability  measures. 

Fix  any  distribution  P  over  X  x  y  and  any  'H  C  [J7]  with  h*P  G  'H.  and  let 

Gn  =  {(x,y)  1  y  £{h{x)y)  :  h  e  %}, 

and  QUtP  =  {( x,y )  £{h(x)y)  -  £(h*P(x)y)  :  h  G  EL}.  (12.17) 

Then,  since  J(a,  Gh,F)  =  J(a,Gn,P,  F),  it  follows  from  Theorem  2.1  of  van  der  Vaart  and 
Wellner  [2011]  (and  a  triangle  inequality)  that  for  some  universal  constant  c  G  [1,  oo),  for  any 
m  G  N,  F  >  F (Gn,P),  and  cr  >  D e(EL]  P), 


<t>e(U\P,m)  < 


cJ 


(12.18) 
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Based  on  (12.18),  it  is  straightforward  to  define  a  function  Of  that  satisfies  Definition  12.5. 
Specifically,  define 


for  c  as  in  (12.18).  By  (12.18),  (p'^  satisfies  (12.5).  Also  note  that  m  (->•  ^\a,P;  m,  P )  is  non¬ 
increasing,  while  <j  i  ^  0^(cr,  P\  m,  P )  is  nondecreasing.  Furthermore,  P  i— >•  A f(e,  Gn,  L2(Q)) 
is  nondecreasing  for  all  Q,  so  that  'H  (->•  .]{a.Q-H.  Y)  is  nondecreasing  as  well;  since  'H  i-> 
F(Q'h  p)  is  also  nondecreasing,  we  see  that  %  i->  ^(a,  P:  m ,  P )  is  nondecreasing.  Similarly, 
for  U  C  X,  J\f(£,gnuh*p,L2(Q))  <  M{e,gH,L2{Q))  for  all  Q,  so  that  J(a,Gnu,h*p,F)  < 

J {^jGn,  F);  because  F{GnUyh*p,p)  <  F(Gh,p),  wehave^1)((T,^Wi/l*p;m,P)  <  ^1}((j,  P]  m,  P) 
as  well.  Thus,  to  satisfy  Definition  12.5,  it  suffices  to  take  fa  =  ftp . 


Bracketing  Entropy.  Our  second  bound  is  a  classic  result  in  empirical  process  theory.  For  func¬ 
tions  r/i  <  r/2,  a  bracket  \gi,  g2\  is  the  set  of  functions  g  G  G*  with  gi  <  g  <  g2,  [g\ ,  g2\  is  called 
an  e-bracket  under  L2(P )  if  \\g1  —  g2\\p  <  £.  Then  A/j]  (e.  (/,  L2(P))  denotes  the  smallest  number 
of  e-brackets  (under  L2{P))  sufficient  to  cover  Q .  For  a  >  0,  define  the  function 

J{](a,G,P)  =  J°  v/l  +  lnAA[](e,^L2(P))de. 

Fix  any  P  C  [P],  and  let  Gn  and  Gup  be  as  above.  Then  since  J[](cr,  Gn,  P)  —  J[](a,  Gh,p,  P)-> 
Lemma  3.4.2  of  van  der  Vaart  and  Wellner  [1996]  and  a  triangle  inequality  imply  that  for  some 
universal  constant  c  e  [1,  oo),  for  any  me  N  and  a  >  D  faP\ P ), 

fa(P\ P,  m)  <  cJD  (a,  Gn,  P)(~^+  J[]  ^  G2H,P)i)  ■  (12-2°) 

\\/m  cr  m  ) 

As-is,  the  right  side  of  (12.20)  nearly  satisfies  Definition  12.5  already.  Only  a  slight  modification 


is  required  to  fulfill  the  requirement  of  monotonicity  in  a.  Specifically,  define 


fa  V,  P;  P,  m)  =  inf  cJQ  (A,  Gn,P) 

A  >cr 


1  |  .JfafaGnPfai 

fan  A  2m 


(12.21) 
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°  °  (2) 

for  c  as  in  (12.20).  Then  taking  —  suffices  to  satisfy  Definition  12.5. 

Since  Definition  12.5  is  satisfied  for  both  ftp  and  ftf \  it  is  also  satisfied  for 

ft  =  min  ftp | .  (12.22) 

For  the  remainder  of  this  section,  we  suppose  ft  is  defined  as  in  (12.22)  (for  all  distributions  P 
over  X  x  y),  and  study  the  implications  arising  from  the  combination  of  this  definition  with  the 
abstract  theorems  above. 

12.5.4  VC  Subgraph  Classes 

For  a  collection  A  of  sets,  a  set  {zi, . . . ,  zk}  of  points  is  said  to  be  shattered  by  A  if  |{H  (T 
{zi, . . . ,  zk}  :  A  E  A}\  =  2k.  The  VC  dimension  vc(„4)  of  A  is  then  defined  as  the  largest 
integer  k  for  which  there  exist  k  points  {zi, . . . ,  zk}  shattered  by  A  [Vapnik  and  Chervonenkis, 
1971];  if  no  such  largest  k  exists,  we  define  vc(.A)  =  oo.  For  a  set  Q  of  real-valued  functions, 
denote  by  vc(Q)  the  VC  dimension  of  the  collection  {{(x,  y)  :  y  <  g(x)}  :  g  E  Q}  of  subgraphs 
of  functions  in  Q  (called  the  pseudo-dimension  [Haussler,  1992,  Pollard,  1990]);  to  simplify 
the  statement  of  results  below,  we  adopt  the  convention  that  when  the  VC  dimension  of  this 
collection  is  0,  we  let  vc (Q)  =  1.  A  set  Q  is  said  to  be  a  VC  subgraph  class  if  vc (Q)  <  oo 
[van  der  Vaart  and  Wellner,  1996]. 

Because  we  are  interested  in  results  concerning  values  of  R ((h)  —  RR/C),  for  functions  h 
in  certain  subsets  K  C  [J],  we  will  formulate  results  below  in  terms  of  vc((7h.  )-  f°r  Gn  defined 
as  above.  Depending  on  certain  properties  of  i ,  these  results  can  often  be  restated  directly  in 
terms  of  vc("H);  for  instance,  this  is  true  when  (!.  is  monotone,  since  vc(Q-H)  <  vc(P)  in  that  case 
[Dudley,  1987,  Haussler,  1992,  Nolan  and  Pollard,  1987], 

The  following  is  a  well-known  result  for  VC  subgraph  classes  [see  e.g.,  van  der  Vaart  and 
Wellner,  1996],  derived  from  the  works  of  Pollard  [1984]  and  Haussler  [1992], 

Lemma  12.14.  For  any  Q  C  Q*,  for  any  measurable  F  >  ¥(Q),  for  any  distribution  Q  such  that 
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q  >  0,  for  any  e  e  (0,1), 

/  i  \  2vc(S) 

JV(e||F||«,C!,L2(Q))</l(5)f-J 

where  A(Q)  <  (vc(Q)  +  l)(16e)vc^l. 

In  particular,  Lemma  12.14  implies  that  any  Q  C  Q*  has,  Vcr  e  (0, 1], 

J  (<?,G ,F)  <  f  y/\n(eA(g))  +  2vc(£)  ln(l/e)de  (12.23) 

Jo 

<  2ay/\n(eA(G))  +  ^8vc(G)  f  i/ln(l/e)d£ 

Jo 

=  2a^/\n(eA(G))  +  <ta/8vc(£)  ln(l/cr)  +  A/27rvc(^)erfc  ^y/ln(l /a) 

Since  erfc(x)  <  exp{— x2}  for  all  x  >  0,  (12.23)  implies  Vcr  e  (0, 1], 

J(a,  g,  F)  <  a \/vc(C/)Log(l/cr).  (12.24) 

Applying  these  observations  to  bound  J(cr,  F)  for  "H  C  [J7]  and  F  >  F (Gh,p),  noting 
J(a,Gn,F)  =  J(<j,Gh,p,  F)  and  vc (Gh,p)  —  vc (Oh)*  and  plugging  the  resulting  bound  into 
(12.19)  yields  the  following  well-known  bound  on  <j>^  due  to  Gine  and  Koltchinskii  [2006].  For 
any  me  N  and  a  >  0, 


m,  P ) 


<  inf  A 


A  >cr 


\ 


vc(<5„)Log  (ggyfe)  vc(a„)«Log 


,P)\\P 


m 


m 


.  (12.25) 


Specifically,  to  arrive  at  (12.25),  we  relaxed  the  infF>F(gH  p)  in  (12.19)  by  taking  F  >  F {Gh,p) 
such  that  ||F||p  =  max{cr,  ||F(^ip)||p},  thus  maintaining  A/||F||p  G  (0, 1]  for  the  minimizing 
A  value,  so  that  (12.24)  remains  valid;  we  also  made  use  of  the  fact  that  Log  >  1,  which  gives  us 
Log(||F||p/A)  =  Log(||F(<7w,p)||P/A)  for  this  case. 

In  particular,  (12.25)  implies 


< 


inf 


a>De([n]^2-AP)-,P)  \7i  7i 


a 


-o  +  —  vc(Gn)Fog 


I|F(^,p)|L 


a 


(12.26) 
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Following  Gine  and  Koltchinskii  [2006],  for  r  >  0,  define  B n>P(h*p,r;£)  =  {g  E  H 
D^(gf,  h*p]  P )2  <  r},  and  for  r0  >  0,  define 


re(r0 ;  T-L.P)  =  sup 

r>ro 


|F  (&*. 


p  (h*  p  ,v,£)  ,P 


)i 


v  1. 


r 


When  P  =  Vxy,  abbreviate  this  as  Te(r0;P )  =  7y(r0;  "H,  Taw),  and  when  V.  =  P,  further 
abbreviate  T/(r0)  =  T>(r0:  P  ■  Vxy)-  For  A  >  0,  when  h*P  E  'H  and  P  satisfies  Condition  12.1 1, 
(12.26)  implies  that, 


sup  M, (7/(4  K),r,H(rJ,P),P) 

7>A 

<  (Jq,  +  vc(^)Log  (t(  (b\^U,P))  .  (12.27) 

Combining  this  observation  with  (12.6),  (12.8),  (12.9),  (12.10),  and  Theorem  12.6,  we  arrive 
at  a  result  for  the  sample  complexity  of  empirical  f-risk  minimization  with  a  general  VC  subgraph 
class  under  Conditions  12.10  and  12.11.  Specifically,  for  s  :  (0,  oo)2  — >  [1,  oo),  when  h*  E  T , 
(12.6)  implies  that 

Me(Te(£y,P. ,VXy,s)  <  M^(r^(e); P ,VX y,s) 

=  sup  Mfiy/2,r,P(r,£),VXY,s(Te(£),y)) 

7>r£(e) 

<  sup  M^/2,TP(rJ),VXY,$(Ti(s)n)).  (12.28) 

7>r£(e) 

Supposing  Vxy  satisfies  Conditions  12.10  and  12.11,  applying  (12.8),  (12.9),  and  (12.27)  to 
(12.28),  and  taking  s(A,  7)  =  Log  (^/) ,  we  arrive  at  the  following  theorem,  which  is  implicit  in 
the  work  of  Gine  and  Koltchinskii  [2006]. 

Theorem  12.15.  For  a  universal  constant  c  E  [1,  00),  ifVXY  satisfies  Condition  12.10  and 
Condition  12.11,  i  is  classification-calibrated,  h*  E  P,  and  4-7  is  as  in  (12.15),  then  for  any 
e  E  (0, 1),  letting  77  =  17  (b^  fic)13) ,  for  any  m  E  N  with 

m  >  c  ^^2-p  +  ^) )  (vc(^)Log  (ti)  +  Log  (1/5)) ,  (12.29) 

with  probability  at  least  1  —  5,  ERMAT7.  Zm)  produces  h  with  er (h)  —  er (h*)  <  £. 
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As  noted  by  Gine  and  Koltchinskii  [2006],  in  the  special  case  when  i  is  itself  the  0-1  loss,  the 


bound  in  Theorem  12.15  simplifies  quite  nicely,  since  in  that  case  HF^b^ Txy(h* ,r-,e),vXY)\\vXY  = 
V  (DIS  (B  (h*,r))),  so  that  re(r0)  =  9(r0);  in  this  case,  we  also  have  vc (Gt)  <  VC(-F)  and 
^i(e)  =  e/2,  and  we  can  take  (3  =  a  and  b  =  a,  so  that  it  suffices  to  have 

m  >  caea~ 2  (vc(Jr)Log  ( 9 )  +  Log  (1/5)) ,  (12.30) 

where  6  =  6  ( aea )  and  c  G  [1,  oo)  is  a  universal  constant.  It  is  known  that  this  is  sometimes  the 
minimax  optimal  number  of  samples  sufficient  for  passive  learning  [Castro  and  Nowak,  2008, 
Hanneke,  2011,  Raginsky  and  Rakhlin,  2011]. 

Next,  we  turn  to  the  performance  of  Algorithm  1  under  the  conditions  of  Theorem  12.15. 
Specifically,  suppose  Vxy  satisfies  Conditions  12.10  and  12.11,  and  for  y0  >  0,  define 

,  N  P  (DIS  (B  (h* ,  a£e  (7)“))) 

Xi\io)  =  sup - - v  L 

7>7o 

Note  that  \\F{GTj,vxY)\\vXY  -  ^  (DIS  ^  5 01)))-  Also>  note  that  vc (QTj)  < 

vc(^(£,(22 -i);01))  <  vc (gT).  Thus,  by  (12.26),  for  jt  <j<  [log2(l/'Ff(£))l’ 

M,(2  22-F  T^Vxy)  <  (b2j^  +  M)  vc(^)Log  (Xe  (^))  £)  •  (12.31) 


With  a  little  additional  work  to  define  an  appropriate  s  function  and  derive  closed-form 
bounds  on  the  summation  in  Theorem  12.7,  we  arrive  at  the  following  theorem  regarding  the 
performance  of  Algorithm  1  for  VC  subgraph  classes.  For  completeness,  the  remaining  techni¬ 
cal  details  of  the  proof  are  included  in  Appendix  12.6 


Theorem  12.16.  For  a  universal  constant  c  G  [1,  00),  if  Vxy  satisfies  Condition  12.10  and  Con¬ 
dition  12.11,  i  is  classification-calibrated,  h*  G  T,  and  is  as  in  (12.15 ),  for  any  e  G  (0, 1),  let¬ 


ting  6  =  6  (as01),  xi  =  X<(^(e)),  Ai  =  vc(^)Log(x^)+Log(l/5),  B1 
and  Ci  =  min  j  ,  .,('n  Log(l/^(e))},  if 


u  - c  U^)2-* +  *&) 


(12.32) 
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and 


^  n  a(b(Al  +  Log{Bl))Bl  ,  I(A\  +  Log(C'1))C'1^ 

n  >  c9ae  - T  ,  ,9  n - t - r- - 

V  WAfV-p  A>Af3 


(12.33) 


then,  with  arguments  t,  u,  and  n,  and  an  appropriate  5  function,  Algorithm  1  uses  at  most  u 
unlabelecl  samples  and  makes  at  most  n  label  requests,  and  with  probability  at  least  1  —  5, 
returns  a  function  h  with  er  (h)  —  er  (h*)  <  e. 

To  be  clear,  in  specifying  If  and  Cf ,  we  have  adopted  the  convention  that  1/0  =  oo  and 
min {oo,  x)  =  x  for  any  x  G  I,  so  that  If  and  Cf  are  well-dehned  even  when  a  =  /3  —  1, 
or  a  =  1,  respectively.  Note  that,  when  a  +  (3  <  2,  Bi  =  0(1),  so  that  the  asymptotic 
dependence  on  £  in  (12.33)  is  O  {9eaf>  t(eY~2 Log(y£)),  while  in  the  case  of  a  —  (3  —  1,  it  is 
O  (0Log(l/£)(Log(0)  +  Log(Log(T/c)))).  It  is  likely  that  the  logarithmic  and  constant  factors 
can  be  improved  in  many  cases  (particularly  the  Log(x^),  If ,  and  Oi  factors). 

Comparing  the  result  in  Theorem  12.16  to  Theorem  12.15,  we  see  that  the  condition  on 
u  in  (12.32)  is  almost  identical  to  the  condition  on  m  in  (12.29),  aside  from  a  change  in  the 
logarithmic  factor,  so  that  the  total  number  of  data  points  needed  is  roughly  the  same.  However, 
the  number  of  labels  indicated  by  (12.33)  may  often  be  significantly  smaller  than  the  condition 
in  (12.29),  reducing  it  by  a  factor  of  roughly  9aea.  This  reduction  is  particularly  strong  when  9 
is  bounded  by  a  finite  constant.  Moreover,  this  is  the  same  type  of  improvement  that  is  known  to 
occur  when  i  is  itself  the  0-1  loss  [Hanneke,  2011],  so  that  in  particular  these  results  agree  with 
the  existing  analysis  in  this  special  case,  and  are  therefore  sometimes  nearly  minimax  [Hanneke, 
2011,  Raginsky  and  Rakhlin,  2011].  Regarding  the  slight  difference  between  (12.32)  and  (12.29) 
from  replacing  t(::  by  x  ff  the  effect  is  somewhat  mixed,  and  which  of  these  is  smaller  may  depend 
on  the  particular  class  T  and  loss  i\  we  can  generally  bound  as  a  function  of  9(aea),  ft,  a,  a, 
b,  and  (5.  In  the  special  case  of  i  equal  the  0-1  loss,  both  ly  and  are  equal  to  9(a(e/ 2)"). 

We  note  that  the  values  s  (m)  used  in  the  proof  of  Theorem  12.16  have  a  direct  dependence  on 
the  parameters  b,  fd,  a,  a,  and  \i-  Such  a  dependence  may  be  undesirable  for  many  applications, 
where  information  about  these  values  is  not  available.  However,  one  can  easily  follow  this  same 
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proof,  taking  s(m)  =  Log  ^121°s^2"i)  j  instead,  which  only  leads  to  an  increase  by  a  log  log 
factor:  specifically,  replacing  the  factor  of  Ai  in  (12.32),  and  the  factors  ( /I ,  +  Log  (77 ))  and 
(A\  +  Log (Ci))  in  (12.33),  with  a  factor  of  (A\  +  Log(Log(Z/\I^(£)))).  It  is  not  clear  whether 
it  is  always  possible  to  achieve  the  slightly  tighter  result  of  Theorem  12.16  without  having  direct 
access  to  the  values  b,  f3,  a,  a,  and  xe  in  the  algorithm. 

As  mentioned  above,  though  convenient  in  the  sense  that  it  offers  a  completely  abstract  and 
unified  approach,  the  choice  of  Tg(V;  Q,  m)  given  by  (12.1 1)  may  often  make  Algorithm  1  com¬ 
putationally  inefficient.  However,  for  each  of  the  applications  studied  here,  we  can  relax  this  7) 
function  to  a  computationally-accessible  value,  which  will  then  allow  the  algorithm  to  be  effi¬ 
cient  under  convexity  conditions  on  the  loss  and  class  of  functions.  In  particular,  in  the  present 
application  to  VC  Subgraph  classes,  Theorem  12.16  remains  valid  if  we  instead  define  7)  as  fol¬ 
lows.  If  we  let  Vlrn>  and  Qm  denote  the  sets  V  and  0  upon  reaching  Step  5  for  any  given  value 
of  m  with  log2(m)  £  N  realized  in  Algorithm  1,  then  consider  defining  7}  in  Step  6  inductively 
by  letting  %l/2  =  b(IQ^2|vl)  (fe( V(m/2);  Qm/2,  m/2 )  A  tj  (or  7m/2  =  £  if  m  =  2),  and  taking 
(with  a  slight  abuse  of  notation  to  allow  7)  to  depend  on  sets  V{ml)  and  Qmi  with  ml  <  in) 


Te(vW-Qm,m)  = 


c0 


m/2 

\Qm\  VI 


\ 


7m/ 2—  VC  (Gt)  Log 


m 


+  —  Vc((yjr)Log 


m 


t(\Qm\  +s(m )) 
mh^m/2 

l(_|Qm|  +s(m)) 

mhlL/2 


sim) 


+  s/m) 


(12.34) 


/ 


for  an  appropriate  universal  constant  cq.  This  value  is  essentially  derived  by  upper  bounding 
m0iUe(Vr>is(v)','PxY,m/2,s(m))  (which  is  a  bound  on  (12.11)  by  Lemma  12.4),  based  on 
(12.25)  and  Condition  12.11  (along  with  a  Chernoff  bound  to  argue  \Qm\  ~  7,(DIS(C))m/2); 
since  the  sample  sizes  derived  for  u  and  n  in  Theorem  12.16  are  based  on  these  relaxations 
anyway,  they  remain  sufficient  (with  slight  changes  to  the  constant  factors)  for  these  relaxed  7) 
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values.  For  brevity,  we  defer  a  more  detailed  proof  that  these  values  of  Tg  suffice  to  achieve 
Theorem  12.16  to  Appendix  12.7.  Note  that  we  have  introduced  a  dependence  on  b  and  (3  in 
(12.34).  These  values  would  indeed  be  available  for  some  applications,  such  as  when  they  are 
derived  from  Lemma  12.12  when  Condition  12.3  is  satisfied;  however,  in  other  cases,  there  may 
be  more-favorable  values  of  b  and  /3  than  given  by  Lemma  12.12,  dependent  on  the  specific 
Vxy  distribution,  and  in  these  cases  direct  observation  of  these  values  might  not  be  available. 
Thus,  there  remains  an  interesting  open  question  of  whether  there  exists  a  function  Tg(V ;  0.  m), 
which  is  efficiently  computable  (under  convexity  assumptions)  and  yet  preserves  the  validity  of 
Theorem  12.16;  this  same  question  applies  to  each  of  the  results  below  as  well. 

In  the  special  case  when  i  satisfies  Condition  12.3,  we  can  derive  a  sometimes-stronger  result 
via  Corollary  12.9.  Specifically,  we  can  combine  (12.26),  (12.8),  (12.9),  and  Lemma  12.12,  to 
get  that  if  h*  G  T  and  Condition  12.3  is  satisfied,  then  for  j  >  jg  in  Corollary  12.9, 

o  ( 2~j~7  22~j  \ 

(12‘35) 

<  (b  +  VePiUjj)  (vc(^)Log  (a^ViUjffb)  +  s) , 

where  b  and  (3  are  as  in  Lemma  12.12.  Plugging  this  into  Corollary  12.9,  with  s  defined  analogous 
to  that  used  in  the  proof  of  Theorem  12.16,  and  bounding  the  summation  in  the  condition  for  n 
in  Corollary  12.9,  we  arrive  at  the  following  theorem.  The  details  of  the  proof  proceed  along 
similar  lines  as  the  proof  of  Theorem  12.16,  and  a  sketch  of  the  remaining  technical  details  is 
included  in  Appendix  12.6. 


Theorem  12.17.  For  a  universal  constant  c  G  [1,  oo),  if  Vxy  satisfies  Condition  12.10,  £  is 
classification-calibrated  and  satisfies  Condition  12.3,  h*  G  2F,  3>g  is  as  in  (12.15),  and  b  and  f3 


are  as  in  Lemma  12.12,  then  for  any  e  G  (0, 1),  letting  0  =  0(a£n),  A2  = 

vc(^)Log  (fj/b)  (a6Ca/'M£))/3)  +Log  (1/5),  B2  =  min  j  1_2(a_11)(2_j3) ,  Log  {I/^g(e))  j,  and 

C2  =  min  |  l  2la_ i) ,  Log  j,  if 


b(a,6eaf-p  i  \ 

^g(e)2-P  +¥^j) 


(12.36) 
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and 


n  >  c  [  b(A2  +  Log(B2))B2 


a6e° 

Vt{e) 


2-/3 


+  k(A2  +  Log(C2))C2 


a6ea 

Vi{e) 


(12.37) 


then,  with  arguments  t,  u,  and  n,  and  an  appropriate  s  function,  Algorithm  1  uses  at  most  u 
unlabeled  samples  and  makes  at  most  n  label  requests,  and  with  probability  at  least  1  —  5, 
returns  a  function  h  with  er  (h)  —  er  (h*)  <  e. 

Examining  the  asymptotic  dependence  on  £  in  the  above  result,  the  sufficient  number  of  un¬ 


labeled  samples  is  O 

2-/3 


(deay 


RjLog 


9£a 

'MO 


,  and  the  sufficient  number  of  label  requests  is 


O 


6ea 

'MO 


Log 


dea 


MO 


in  the  casethata  <  l,or O  (6>2  /3Log(l/£)Log  (6^Log(l/£))) 


in  the  case  that  a  —  1.  This  is  noteworthy  in  the  case  a  >  0  and  ry  >  2,  for  at  least  two  rea¬ 
sons.  First,  the  number  of  label  requests  indicated  by  this  result  can  often  be  smaller  than  that 
indicated  by  Theorem  12.16,  by  a  factor  of  roughly  O  ^(0eQ  )1  ;j  ;  this  is  particularly  interesting 
when  6  is  bounded  by  a  finite  constant.  The  second  interesting  feature  of  this  result  is  that  even 
the  sufficient  number  of  unlabeled  samples,  as  indicated  by  (12.36),  can  often  be  smaller  than 
the  number  of  labeled  samples  sufficient  for  ERM^,  as  indicated  by  Theorem  12.15,  again  by  a 
factor  of  roughly  O  ^(Yic0)1  "j.  This  indicates  that,  in  the  case  of  a  surrogate  loss  i  satisfying 
Condition  12.3  with  ry  >  2,  when  Theorem  12.15  is  tight,  even  if  we  have  complete  access  to  a 
fully  labeled  data  set,  we  may  still  prefer  to  use  Algorithm  1  rather  than  ERJVR;  this  is  somewhat 
surprising,  since  (as  (12.37)  indicates)  we  expect  Algorithm  1  to  ignore  the  vast  majority  of  the 
labels  in  this  case.  That  said,  it  is  not  clear  whether  there  exist  natural  classification-calibrated 
losses  i  satisfying  Condition  12.3  with  ry  >  2  for  which  the  indicated  sufficient  size  of  m  in 
Theorem  12.15  is  ever  competitive  with  the  known  results  for  methods  that  directly  optimize  the 
empirical  0-1  risk  (i.e.,  Theorem  12.15  with  i  the  0-1  loss);  thus,  the  improvements  in  u  and  n  re¬ 
flected  by  Theorem  12.17  may  simply  indicate  that  Algorithm  1  is,  to  some  extent,  compensating 
for  a  choice  of  loss  k  that  would  otherwise  lead  to  suboptimal  label  complexities. 

We  note  that,  as  in  Theorem  12.16,  the  values  s  used  to  obtain  this  result  have  a  direct 
dependence  on  certain  values,  which  are  typically  not  directly  accessible  in  practice:  in  this 
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case,  a,  a ,  and  9.  However,  as  was  the  case  for  Theorem  12.16,  we  can  obtain  only  slightly 
worse  results  by  instead  taking  s(m)  =  Log  ^l2los^2m>  'j ,  which  again  only  leads  to  an  increase 
by  a  log  log  factor:  replacing  the  factor  of  A2  in  (12.36),  and  the  factors  (A2  +  Log(/i2))  and 
(A2  +  Log(C2))  in  (12.37),  with  a  factor  of  (A2  +  Log(Log(l/^(e)))).  As  before,  it  is  not  clear 
whether  the  slightly  tighter  result  of  Theorem  12.17  is  always  available,  without  requiring  direct 
dependence  on  these  quantities. 

As  was  also  true  of  Theorem  12.16,  while  the  above  choice  of  Tg(V;  Q,  m)  given  by  (12.1 1) 
provides  an  elegant  unifying  perspective,  it  may  often  be  infeasible  to  calculate  efficiently.  How¬ 
ever,  as  was  possible  in  that  case,  we  can  define  an  alternative  that  is  specialized  to  the  conditions 
of  Theorem  12.17,  for  which  the  theorem  statement  remains  valid.  Specifically,  consider  instead 
defining  Tg  in  Step  6  as 


Tg(yW-Qm,m) 

= Co  (vc(fc)Log 0  (5^) "')  +i(m>))  (1238) 

for  b  and  /3  as  in  Lemma  12.12,  and  for  an  appropriate  universal  constant  c0.  This  value  is  essen¬ 
tially  derived  by  bounding  Ug(V ;  V-dis(v),  -s(m)),  which  is  informative  in  Step  6  via  Lemma  12.4. 
Since  Theorem  12.17  is  proven  by  considering  concentration  under  the  conditional  distributions 
Vuj  via  Corollary  12.9,  and  (12.38)  represents  the  concentration  bound  one  gets  from  directly 
applying  Lemma  12.4  to  the  samples  from  the  conditional  distribution  'PDIS(y(m)),  one  can  show 
that  the  conclusions  of  Theorem  12.17  remain  valid  for  this  specification  of  Tg  in  place  of  (12.11). 
For  brevity,  the  details  of  the  proof  are  omitted.  Note  that,  unlike  the  analogous  result  for  The¬ 
orem  12.16  based  on  (12.34)  above,  in  this  case  all  of  the  quantities  in  Tg(V ;  Q,  m)  are  directly 
observable  (in  particular,  b  and  /3),  aside  from  any  possible  dependence  arising  in  the  specifica¬ 
tion  of  s. 
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12.5.5  Entropy  Conditions 


Next  we  turn  to  problems  satisfying  certain  entropy  conditions.  In  particular,  the  following 
represent  two  commonly-studied  conditions,  which  allow  for  concise  statement  of  results  below. 
Condition  12.18.  For  some  q  >  1,  p  G  (0, 1),  and  F  >  F (Gt,vxy),  either  \/e  >  0, 

lnJVD(e||F|| vxy,Gt,L2(VXy))  <  qe~2p ,  (12.39) 

or  for  all  finitely  discrete  P,  Vc  >  0, 

In  Af(£||F||P,  QT,  L2(P ))  <  qe~2p.  (12.40) 


In  particular,  note  that  when  T  satisfies  Condition  12.18,  for  0  <  a  <  2||F||pxy, 

\  V4llFll5 


{a,  V;  VXY,  m)  <  max 


I  VxY^1  P  Z1+PpP+P\ 


|  (1  -  p)m 


1/2 


_2_  _J_ 

(1  —  p) 1 +pm1+p 


(12.41) 


Since  I)/f  [J7])  <  2||F||pxy,  this  implies  that  for  any  numerical  constant  c  G  (0, 1],  for  every 
7  G  (0,  oo),  if  Vxy  satisfies  Condition  12.11,  then 


M t{(n,r,F,VxY)  <  max{61-V(1~p)~2,^P7~(1+p)}  •  (12-42) 

Combined  with  (12.8),  (12.9),  (12.10),  and  Theorem  12.6,  taking  s(A,  7)  =  Log  (77),  we  arrive 
at  the  following  classic  result  [e.g.,  Bartlett,  Jordan,  and  McAuliffe,  2006,  van  der  Vaart  and 
Wellner,  1996]. 

Theorem  12.19.  For  a  universal  constant  c  G  [1,  00),  if  Vxy  satisfies  Condition  12.10  and 
Condition  12.11,  T  and  Vxy  satisfy  Condition  12.18,  t  is  classification-calibrated,  h  *  G  T,  and 
'1-7  is  as  in  (12.15),  then  for  any  e  G  (0, 1)  and  m  with 


m  >  c«E!!sa  ( _ b2l _ + 

+  C  +  ^i))  L°g  (j)  ’ 

with  probability  at  least  1  —  5,  ERM/  f  JF  Zm)  produces  h  with  er (li)  —  er (h*)  <  e. 
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Next,  turning  to  the  analysis  of  Algorithm  1  under  these  same  conditions,  combining  (12.42) 
with  (12.8),  (12.9),  and  Theorem  12.7,  we  have  the  following  result.  The  details  of  the  proof 
follow  analogously  to  the  proof  of  Theorem  12.16,  and  are  therefore  omitted  for  brevity. 
Theorem  12.20.  For  a  universal  constant  c  G  [1,  oo),  ifVxy  satisfies  Condition  12.10  and 
Condition  12.11,  T  and  Vxy  satisfy  Condition  12.18,  £  is  classification-calibrated,  h*  G  J~,  and 
is  as  in  (12.15),  then  for  any  e  G  (0, 1),  letting  Bi  and  C1  be  as  in  Theorem  12.16,  B3  = 
min  { i  Log(l/TT(e))},  C3  =  min  j  1_2(e,i(1+p)) ,  Log(£/^(g))  j,  and  9  =  9  (aea), 

if 

.  gll nlPXY  (  b'-P  ,  lx~p 
-  (1  -  p )2  V^^)2_/3(1_p)  ^e{e)1+p 

+  c(T7Fi5  +  iT))LogG)  <12'43) 

and 


n  >  c9ae' 


2  P 

Vxy 


b1~pB3 


+ 


C~pC3 


(1  —  p)2  \  ^ vh n{e)1+p 

+  c9ae' 


6fi1Log(JB1/5)  £CiLog(Ci/8) 


(12.44) 


^fief-9  ^(e) 

then,  with  arguments  f  u,  and  n,  and  an  appropriate  s  function,  Algorithm  1  uses  at  most  u 
unlabeled  samples  and  makes  at  most  n  label  requests,  and  with  probability  at  least  1  —  5, 
returns  a  function  h  with  er  (h)  —  er  (  //  * )  <  e. 

The  sufficient  size  of  u  in  Theorem  12.20  is  essentially  identical  (up  to  the  constant  factors) 
to  the  number  of  labels  sufficient  for  ERAfi  to  achieve  the  same,  as  indicated  by  Theorem  12. 19. 
In  particular,  the  dependence  on  e  in  these  results  is  O  .  On  the  other  hand,  when 

9 (c ° )  =  o(e~a),  the  sufficient  size  of  n  in  Theorem  12.20  does  reflect  an  improvement  in  the 
number  of  labels  indicated  by  Theorem  12.19,  by  a  factor  with  dependence  on  e  of  O  (9en). 

As  before,  in  the  special  case  when  £  satisfies  Condition  12.3,  we  can  derive  sometimes 
stronger  results  via  Corollary  12.9.  In  this  case,  we  will  distinguish  between  the  cases  of  (12.40) 
and  (12.39),  as  we  find  a  slightly  stronger  result  for  the  former. 
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First,  suppose  (12.40)  is  satisfied  for  all  finitely  discrete  P  and  all  £  >  0,  with  F  <  I.  Then 
following  the  derivation  of  (12.42)  above,  combined  with  (12.9),  (12.8),  and  Lemma  12.12,  for 
values  of  j  >  je  in  Corollary  12.9, 

o  /  2~j~7  22~j 

Mi\V{U^,V{U^',:Fj,Vui,S 

<  (b1-?  (2 9V{Uj))2~P{1~p)  +  f-p  (2 jV(Uj))1+P) 

+  {b  (2 iV{Uj))2-p  +  WV{Uj))s, 


where  q  and  p  are  from  Lemma  12.12.  This  immediately  leads  to  the  following  result  by  reason¬ 
ing  analogous  to  the  proof  of  Theorem  12.17. 

Theorem  12.21.  For  a  universal  constant  c  G  [l,oo),  ifVxY  satisfies  Condition  12.10,  (i  is 
classification-calibrated  and  satisfies  Condition  12.3,  h*  G  J-,  T is  as  in  (12.15),  b  and  (3  are  as 
in  Lemma  12.12,  and  (12.40)  is  satisfied  for  all  finitely  discrete  P  and  all  e  >  0,  with  F  <  I ,  then 
for  any  e  G  (0, 1),  letting  B2  and  C2  be  as  in  Theorem  12.17,  If  =  min  j  1__2(a_i)^-g(i-p)) ,  Log 
Ca  =  min  j  1_2(a_11)(1+p) ,  Log(£/^(e))|,  and  6  =  6  ( aea ),  if 


u  >  c 


q£2p 


(1  ~pf 


p-P 


Vt(e)J  V^(£) 


a6ec 


i-4(i -p) 


+ 


pi -p 


\Ve(e) 


a6e° 


*t(e)J  \Ve(e) 


a6ec 


1-4 


+ 


1 


J 


Log(l/ 5) 


and 


n  >  c 


(  qfP  \ 


f  a6e° 


2-4(1— P) 

+  Cj>~p 


(  adea 


+  c 


B2Log{B2/5)b 


f  adea 


2-4 

+  C2Log(C2/5)£ 


a6ea  \ 

Mf)) 


5 


then,  with  arguments  £,  u,  and  n,  and  an  appropriate  $  function,  Algorithm  1  uses  at  most  u 
unlabelecl  samples  and  makes  at  most  n  label  requests,  and  with  probability  at  least  1  —  5, 
returns  a  function  h  with  er  {h)  —  er(h*)  <  e. 
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Compared  to  Theorem  12.20,  in  terms  of  the  asymptotic  dependence  on  e,  the  sufficient 
sizes  for  both  u  and  n  here  may  be  smaller  by  a  factor  of  O  (^(0en)'  l  ,  which  sometimes 
represents  a  significant  refinement,  particularly  when  6  is  much  smaller  than  e~a.  In  particular, 
as  was  the  case  in  Theorem  12.17,  when  9(e)  =  o(l/e),  the  size  of  u  indicated  by  Theorem  12.21 
is  smaller  than  the  known  results  for  ERM/fJ7.  Zm)  from  Theorem  12.19. 

The  case  where  (12.39)  is  satisfied  can  be  treated  similarly,  though  the  result  we  obtain  here 
is  slightly  weaker.  Specifically,  for  simplicity  suppose  (12.39)  is  satisfied  with  F  =  l  constant.  In 
this  case,  we  have  t  >  F(gTjtVu,)  as  well,  while  N\\(e£,  L2(VUj))  =  A/q  (el^JV  (Uj ) ,  QTj ,  L2  (Vxy  ) ) , 
which  is  no  larger  than  N\](e£y/V(flj),  Qx-  L2(Vxy)),  so  that  T,  and  Vu:i  also  satisfy  (12.39) 
with  F  —  i\  specifically, 


InA r{]  (slgT:pL2(VUj))  <  qV(Uj)-pE-2p. 

Thus,  based  on  (12.42),  (12.8),  (12.9),  and  Lemma  12.12,  we  have  that  if  h*  e  T  and  Condi¬ 
tion  12.3  is  satisfied,  then  for  j  >  j(  in  Corollary  12.9, 

o  /  2~j~7  22~j 

Mi  yviUj)'  v(uj),:Fj,Vuj,s 

<  pW)"'  (b1-' (VVM))2-"1-*  +  P-"  (2 \iV(Uj))1+p) 

+  (l,(VV(Uj)f-f‘  +  12^^))  s, 

where  b  and  /3  are  as  in  Lemma  12.12.  Combining  this  with  Corollary  12.9  and  reasoning  analo¬ 
gously  to  the  proof  of  Theorem  12.17,  we  have  the  following  result. 

Theorem  12.22.  For  a  universal  constant  c  G  [1,  oo),  if  Vxy  satisfies  Condition  12.10,  £  is 
classification-calibrated  and  satisfies  Condition  12.3,  h*  G  T,  is  as  in  (12.15),  b  and  ft 
are  as  in  Lemma  12.12,  and  (12.39)  is  satisfied  with  F  =  I  constant,  then  for  any  e  G  (0, 1), 
letting  B-2  and  C2  be  as  in  Theorem  12.17,  B5  =  min  j  1_2(t,-1)(2l^c1-p))-n,p ,  Log  }>  C5  = 
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nun 


Log  (  J  j,  andd  =  6  (aea),  if 


u  >  c 


q£2p 


bl~P 


(i-p)2/  \\^^)1+PJ  V^(£) 


a9ea  \  {1_/3)(1-p) 


4T(a)1+p 
a,6Ca  x  1-/3 


*<(£)/  V^(£) 


+ 


*<(e) 


Log(l/ S) 


and 


(  q?2p  \  f  f  £  adea  \1+{1~m~p)  Cd1~padea\ 

U  ~  °  U1  -P)2/  V^(£)/  +  ^(£)1+p  J 

(f  nOea  \  f  nf)pa  \  \ 

bBilozWS)  (m)  +RUMC./S)  (^))J  . 

then,  with  arguments  £,  u,  and  n,  and  an  appropriate  s  function,  Algorithm  1  uses  at  most  u 
unlabeled  samples  and  makes  at  most  n  label  requests,  and  with  probability  at  least  1  —  5, 
returns  a  function  h  with  er(h)  —  er  (h*)  <  e. 

In  this  case,  compared  to  Theorem  12.20,  in  terms  of  the  asymptotic  dependence  on  e,  the 
sufficient  sizes  for  both  u  and  n  here  may  be  smaller  by  a  factor  of  0  ^(0ea)('  ?  which 

may  sometimes  be  significant,  though  not  quite  as  dramatic  a  refinement  as  we  found  under 
(12.40)  in  Theorem  12.21.  As  with  Theorem  12.21,  when  9(e)  =  o(l/e),  the  size  of  u  indicated 
by  Theorem  12.22  is  smaller  than  the  known  results  for  ERM^J7.  Zm)  from  Theorem  12.19. 


12.5.6  Remarks  on  VC  Major  and  VC  Hull  Classes 

Another  widely- studied  family  of  function  classes  includes  VC  Major  classes.  Specifically,  we 
say  Q  is  a  VC  Major  class  with  index  d  if  d  =  vc({{^  :  g(z)  >  t}  :  g  e  Q.t  G  M})  <  oo. 
We  can  derive  results  for  VC  Major  classes,  analogously  to  the  above,  as  follows.  For  brevity, 
we  leave  many  of  the  details  as  an  exercise  for  the  reader.  For  any  VC  Major  class  Q  C  Q* 
with  index  d,  by  reasoning  similar  to  that  of  Gine  and  Koltchinskii  [2006],  one  can  show  that  if 
F  =  l\u  >  F(C/)  for  some  measurable  IA  C  X  x  y ,  then  for  any  distribution  P  and  e  >  0, 

In  A/"  (e\\F\\p,Q ,  L2(P))  <  ^  log  Q  log  Q  . 
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This  implies  that  for  T  a  VC  Major  class,  and  i  classification-calibrated  and  either  nonincreasing 
or  Lipschitz,  if  h*  G  T  and  Vxy  satisfies  Condition  12.10  and  Condition  12.11,  then  the  condi¬ 
tions  of  Theorem  12.7  can  be  satisfied  with  the  probability  bound  being  at  least  1  —  5,  for  some 
u  =  0  +  'W'3-2)  and  n  =  6  ( ,  where  6  =  9(aea),  and 

O(-)  hides  logarithmic  and  constant  factors.  Under  Condition  12.3,  with  (3  as  in  Lemma  12.12, 
the  conditions  of  Corollary  12.9  can  be  satisfied  with  the  probability  bound  being  at  least  1  —  5, 
for  some  u  —  6  ((*^5)  ^  andn  =  6  ((|g;)2 

For  example,  for  X  =  [0, 1]  and  T  the  class  of  all  nondecreasing  functions  mapping  X  to 
[— 1, 1],  T  is  a  VC  Major  class  with  index  1,  and  0(0)  <  2  for  all  distributions  V .  Thus,  for 
instance,  if  r/  is  nondecreasing  and  t  is  the  quadratic  loss,  then  h*  G  V,  and  Algorithm  1  achieves 
excess  error  rate  e  with  high  probability  for  some  u  =  O  (e2a~3)  and  n  —  O  (c3f,1~ . 


VC  Major  classes  are  contained  in  special  types  of  VC  Hull  classes,  which  are  more  generally 
defined  as  follows.  Let  C  be  a  VC  Subgraph  class  of  functions  on  X,  with  bounded  envelope,  and 


denote  the 


for  B  e  (0,  oo),  let  T  =  B conv(C)  =  j x  ^  B  JV  A jhj(x)  :  JV  |Aj|  <  1,  hj  G  c| 

scaled  symmetric  convex  hull  of  C;  then  T  is  called  a  VC  Hull  class.  For  instance,  these  spaces 
are  often  used  in  conjunction  with  the  popular  AdaBoost  learning  algorithm.  One  can  derive 
results  for  VC  Hull  classes  following  analogously  to  the  above,  using  established  bounds  on  the 
uniform  covering  numbers  of  VC  Hull  classes  [see  van  der  Vaart  and  Wellner,  1996,  Corollary 
2.6.12],  and  noting  that  for  any  VC  Hull  class  J-  with  envelope  function  F,  and  any  IA  C  X,TU 
is  also  a  VC  Hull  class,  with  envelope  function  FIW.  Specifically,  one  can  use  these  observations 
to  derive  the  following  results.  For  a  VC  Hull  class  T  =  /Jconv(C)  with  d  =  2vc(C),  if  t  is 
classification-calibrated  and  Lipschitz,  h*  G  J-,  and  Vxy  satisfies  Condition  12.10  and  Condi¬ 
tion  12.11,  then  the  conditions  ofTheorem  12.7  can  be  satisfied  with  the  probability  bound  being 


2§_ 


2d+2 


M, 


at  least  1  —  5,  for  some  u  =  O  [(Oe01)  d+2  \E^(e)  d+2  j  and  n  —  O  (J0£a)  d+2  ^e(e)  d+2  J .  Un¬ 
der  Condition  12.3,  with  /3  as  in  Lemma  12.12,  the  conditions  of  Corollary  12.9  can  be  satisfied 


with  the  probability  bound  being  at  least  1  —  5,  for  some  u  =  O 


i 


9ea 

mt(e)  )  1  ^lle) 


i  2  g 
d+ 2 


and 
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N  2--&- 

9ea  \  d+ 2 


*e(e) 

any  practical  implications,  since  we  do  not  know  of  any  examples  of  VC  Hull  classes  where  these 


results  reflect  an  improvement  over  a  more  direct  analysis  of  ERJVR  for  these  scenarios. 


^  .  However,  it  is  not  clear  whether  these  results  for  VC  Hull  classes  have 


12.6  Proofs 

Proof  of  Theorem  12.7.  Fix  any  £  £  (0, 1),  s  £  [1,  oo),  values  u:/  satisfying  (12.12),  and  consider 
running  Algorithm  1  with  values  of  u  and  n  satisfying  the  conditions  specified  in  Theorem  12.7. 
The  proof  has  two  main  components:  first,  showing  that,  with  high  probability,  h*  £  V  is  main¬ 
tained  as  an  invariant,  and  second,  showing  that,  with  high  probability,  the  set  V  will  be  suffi¬ 
ciently  reduced  to  provide  the  guarantee  on  h  after  at  most  the  stated  number  of  label  requests, 
given  the  value  of  u  is  as  large  as  stated.  Both  of  these  components  are  served  by  the  following 
application  of  Lemma  12.4. 

Let  S  denote  the  set  of  values  of  m  obtained  in  Algorithm  1  for  which  log2(m)  £  N.  For 
each  m  £  S,  let  V(rn)  and  Qrn  denote  the  values  of  V  and  0  (respectively)  upon  reaching 
Step  5  on  the  round  that  Algorithm  1  obtains  that  value  of  m,  and  let  V(in>  denote  the  value 
of  V  upon  completing  Step  6  on  that  round;  also  denote  Dm  =  DIS(V^m))  and  Cm  =  {(1  + 
m/2,  Y1+m/2), . . . ,  (m,  Ym)},  and  define  RF)  —  jr  an(j  d1  —  DIS(Jr). 

Consider  any  m  £  S,  and  note  that  V/i,  g  £  V<rn>, 

(\Qm\  V  1)  (R e{h;  Qm)  -  R e(g\  Qm )) 

777- 

=  J  ^m)  -  R^(fi 'Dm;  C-m))  ,  (12.45) 

and  furthermore  that 

{\Qm\Y  l)Ut{V^Qm,s{m))  =  jUe(V^-£m,s(m)).  (12.46) 

Applying  Lemma  12.4  under  the  conditional  distribution  given  V(rn),  combined  with  the  law  of 
total  probability,  we  have  that,  for  every  m  £  N  with  log2(m)  £  N,  on  an  event  of  probability 


263 


at  least  1  —  6e  s(m\  if  h*  G  and  m  G  S',  then  letting  Um  =  Ut  (vj^;  £m,s(m)j,  every 
h  Dm  G  v£}  has 

R^(/fDm)  —  Re(h*)  <  R i{hDm'i  £m)  ~  Re(h*]  Cm )  +  f7m,  (12.47) 

—  min  R ■e(gDm',£m)  <  —  R<* ( h* )  +  Um,  (12.48) 

and  furthermore 

Urn  <  Ui  (y^-,VxY,  m/2,  s(m))  .  (12.49) 

By  a  union  bound,  on  an  event  of  probability  at  least  1  —  ge-s(2!)^  for  every  m  G  S 

with  m  <  Uje  and  h*  G  V^m\  the  inequalities  (12.47),  (12.48),  and  (12.49)  hold.  Call  this  event 
E. 

In  particular,  note  that  on  the  event  E,  for  any  m  G  S  with  m  <  Uje  and  h*  G  VirnK  since 
h* Dm  =  h*,  (12.45),  (12.48),  and  (12.46)  imply 

{\Qm\  V  1)  (r  Qm)  -  inf  R  t{g\  Qm)] 

\  9ev(“>  ) 

R t -  inf  R/; (gDrn ;  Qm) 

9DmZV<£ 

<  jUm  =  (\Qm\  V  1  )Ue{vW;Qm,s{m)), 

so  that  h*  G  V ^  as  well.  Since  h*  G  V^,  and  every  m  G  S  with  m  >  2  has  V (m)  = 
by  induction  we  have  that,  on  the  event  E,  every  m  G  S'  with  m  <  Uje  has  h*  G  V(m>  and 
h*  G  R(m);  this  also  implies  that  (12.47),  (12.48),  and  (12.49)  all  hold  for  these  values  of  m  on 
the  event  E. 

We  next  prove  by  induction  that,  on  the  event  E,  Vj  G  {je  —  2,  j?  —  l,je, . . .  ,j£},  if  Uj  G 
S  U  {1},  then  vj?1'*  C  [E](2~j;£)  and  (/(“?)  c  E  (£^(2_J);  oi).  This  claim  is  trivially  satisfied 

ui 

for  j  G  {je  —  2,  je  —  1},  since  in  that  case  [E] (2-J;  i)  =  [E]  D  V^h ]  and  Jr(£^(2_J’);  oi)  =  E, 

ui 

so  that  these  values  can  serve  as  our  base  case.  Now  take  as  an  inductive  hypothesis  that,  for 
some  j  G  {je, . . .  ,je},  if  Uj- 2  G  S  U  {1},  then  on  the  event  E,  C  \E\(E^]:  £)  and 

j  —  2 
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\/K-2)  c  T  (£^(22— •7);  01),  and  suppose  the  event  E  occurs.  If  tij  S,  the  claim  is  trivially 
satished;  otherwise,  suppose  Uj  G  S,  which  further  implies  u?_2  G  S  U  {1}.  Since  Uj  <  Uje,  for 
any  h  G  V^Uj\  (12.47)  implies 

y  (R((V,)  -  <  y  (R,(fty  +  ft,,)  . 

Since  we  have  already  established  that  h*  G  V(,lj\  (12.45)  and  (12.46)  imply 


y  (RdS^,)  -  R((/i*; £„,)  +  A) 

=  (|Q%|  V  1)  (r,(/»;Q%.)  -  R, (A*;  Q„,.)  +  «(%)))  . 

The  definition  of  V(Uj>  from  Step  6  implies 

(\QUj\  V  1)  (R e(h;  QUj)  -  Re(h*\ QUj)  +  Ue(V^-,  QUj,s(uj))) 

<  (\QUj\  V  1)  (2 Ue(VM\QUi,s(^)))  ■ 


By  (12.46)  and  (12.49), 


(\Quj\  V  1)  (2Ue(V^-,Qups(u,)))  =  ujUUj<ujUt{y^;PXY,uj/2,s(uj 

Altogether,  we  have  that,  \/h  G  V<-U'l\ 


MhDuj )  -  R t{h*)  <  2U£  (v^:VXY,u]/2,s(uj))  .  (12.50) 

By  definition  of  IvR,  monotonicity  ofmG  Ue(',  •>  •,  m,  ■),  and  the  condition  on  u3  in  (12.12),  we 
know  that 

Ui  (Tv22-jiVXYlu3/21s(uj))  < 

The  fact  that  u3  >  2mj_2,  combined  with  the  inductive  hypothesis,  implies 


C  R(^-2)  C  T  (££(22-j');oi)  . 


This  also  implies  DUj  C  DIS(Jr(£^(22  J);  01)).  Combined  with  (12.7),  these  imply 


U, 


T/(uj) 

VDuj  ’ 


)2-i. 


;  'Pyy,  Wj/2,  s(uj 


<  2_J_1. 
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Together  with  (12.6),  this  implies 

Ue  (v^(22-j-fy,VXY,uj/2,s(uj))  < 

The  inductive  hypothesis  implies  £) ,  which  means 

Plugging  this  into  (12.50)  implies,  V/i  G 

R^Du.)-R£(r)<2-T  (12.51) 

In  particular,  since  h*  G  J7,  we  always  have  C  [J7],  so  that  (12.51)  establishes  that  C 
[J7](2_:';  £).  Furthermore,  since  h*  G  l/^ij)  on  e,  sign (hoUj)  =  sign(/i)  for  every  h  G  12  (“R,  so 
that  every  /?.  G  12 ‘w2  has  or (72)  =  er (hoUj),  and  therefore  (by  definition  of  £/(•)),  (12.51)  implies 

er(/i)  —  er (h*)  =  er (hDUj)  —  er (h*)  <  8>e  (2~J)  . 

This  implies  RK)  c  .F(£<(2-');oi),  which  completes  the  inductive  proof.  This  implies  that,  on 
the  event  E,  if  u]e  G  S,  then  (by  monotonicity  of  Ee(-)  and  the  fact  that  £/(T /-(c))  <  e) 

v(nje)  Q  J7(£e(2-i')-01)  C  ^(£,(T,(£));oi)  C  ^(ejoi). 

In  particular,  since  the  update  in  Step  6  always  keeps  at  least  one  element  in  12,  the  function 
h  in  Step  8  exists,  and  has  h  G  I-  (if  u]e  G  S ).  Thus,  on  the  event  E,  if  uh  G  S,  then 
er (h)  —  er(/i*)  <  e.  Therefore,  since  u  >  Uje,  to  complete  the  proof  it  suffices  to  show  that 
taking  n  of  the  size  indicated  in  the  theorem  statement  suffices  to  guarantee  Uje  G  S,  on  an  event 
(which  includes  E)  having  at  least  the  stated  probability. 

Note  that  for  any  j  G  {j(, . . . ,  j£]  with  Uj-i  G  S  U  { 1} ,  every  m  G  {uj-i  +  1, . . . ,  Uj}  D  S 
has  12(”7  c  12(uj-i);  furthermore,  we  showed  above  that  on  the  event  E,  if  Uj- 1  G  S,  then 
y(«i-i)  C  J7(££(21-j);oi),  so  that  DIS(12(m))  C  DlS^i-i))  C  DIS(J7(££(21-J);  oi))  C  Ur 
Thus,  on  the  event  E,  to  guarantee  uJe  G  S,  it  suffices  to  have 

je  uj 

n  > 

3=31  m=u.j- i+l 
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Noting  that  this  is  a  sum  of  independent  Bernoulli  random  variables,  a  Chernoff  bound  implies 
that  on  an  event  E'  of  probability  at  least  1  —  2~s, 

3s  Uj  je  Uj 

E  E  hij(Xm)  <  s  +  2e  E  E  p(wi) 

3=31  m=Uj- 1+1  j=je  m=u.j- 1+1 

je  je 

=  s  +  2eYl  “  ui~i)  <  s  +  2e  V{^j)ur 

3=3 1  3=3 1 

Thus,  for  n  satisfying  the  condition  in  the  theorem  statement,  on  the  event  E  D  E' ,  we  have 
Uje  £  S,  and  therefore  (as  proven  above)  er (h)  —  er (h*)  <  e.  Finally,  a  union  bound  implies  that 
the  event  E  D  E'  has  probability  at  least 

log2(uJS) 

1  —  2~s  —  6e-®(2i), 

i= 1 

as  required.  □ 

Proof  of  Lemma  12.8.  If  P  (DISF("H))  =  0,  then  =  0,  so  that  in  this  case, 

trivially  satisfies  (12.5).  Otherwise,  suppose  P  (DISF('P))  >  0.  By  the  classic  symmetrization 
inequality  [e.g.,  van  der  Vaart  and  Wellner,  1996,  Lemma  2.3.1], 


4>e(l-L,  m,  P )  <  2E 


Qt  “[m]) 


5 


where  Q  ~  P'n  and  S[m]  =  {E . . . .  ,  ~  Uniform({  — 1,  +l}m)  are  independent.  Fix  any 

measurable  U  D  DISF(H).  Then 


E 


flijH]  Q,  ^[m]) 


=  E 


MU-,Q  nw,S[|QnW|]) 


\Qnu\ 


m 


(12.52) 


where  E[q]  =  £q}  for  any  q  £  {0, . . . ,  m}.  By  the  classic  desymmetrization  inequality 

[see  e.g.,  Koltchinskii,  2008],  applied  under  the  conditional  distribution  given  \Q  D  li\,  the  right 
hand  side  of  (12.52)  is  at  most 


E 


2(j)e{n,\QnU\,Pu\ 


\Qnu\ 


m 


E 


+  sup  |R e(h;  Pu)  -  R e(g;  Pu)\ 
h.gen 


sjWYU\ 


m 


(12.53) 
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By  Jensen’s  inequality,  the  second  term  in  (12.53)  is  at  most 


sup  |R t(h\  Pu) 

h,g£H 


Pu)  | 


<D  t{U\Pu) 


D  e{W,P) 


Decomposing  based  on  \Q  D  li\,  the  hrst  term  in  (12.53)  is  at  most 


E 


2(fie{n,\Qnu\,Pu)^I^-i[\Qnu\  >  (1/2 )P{U)m] 


m 


+  2 £P(U)F  (|Q  n  U\  <  (l/2)P(U)m) .  (12.54) 


Since  |Q C\U\  >  (1/2 )P(lA)m=>  \QC\U\  >  \(l/2)P(lA)m\,  and  (fie(U,  q,  Pu)  is  nonincreasing 
in  q,  the  first  term  in  (12.54)  is  at  most 


2 4>t{U,  \(l/2)P(U)m],Pu)E 


\Qnu 

m 


2<j>e{n,\{l/2)P{U)mlPu)P{U), 


while  a  Chernoff  bound  implies  the  second  term  in  (12.54)  is  at  most 


2£P(U)  exp  {—P(U)m/8}  < 


16£ 

m 


Plugging  back  into  (12.53),  we  have 

32  f  rr 

<i>e(H,m,P)  <  AMP,  \(l/2)P(U)m],  Pu)P(U)  +  —  +  2D e(H;P)J-.  (12.55) 

m  V  m 

Next,  note  that,  for  any  o  >  D P),  ^2^  >  D Pu)-  Also,  if  U  —  U'  x  y  for  some 
U'  D  DISF("H),  then  h*pu  =  h*p,  so  that  if  h*p  G  H,  (12.5)  implies 


MU,  f(l/2 )P{U)m\Pu)  < 


a 


--,H;  f(l/2 )P(U)m],Pu 


(12.56) 


Combining  (12.55)  with  (12.56),  we  see  that  (p[  satisfies  the  condition  (12.5)  of  Definition  12.5. 

Furthermore,  by  the  fact  that  (fie  satisfies  (12.4)  of  Definition  12.5,  combined  with  the  mono¬ 
tonicity  imposed  by  the  infimum  in  the  definition  of  (fi'e,  it  is  easy  to  check  that  <fi'e  also  satisfies 
(12.4)  of  Definition  12.5.  In  particular,  note  that  any  H"  C  'H'  C  'P\  and  U"  C  X  have 
DISF(4f^,/)  C  DISFCH'),  so  that  the  range  of  U  in  the  infimum  is  never  smaller  for  %  = 
relative  to  that  for  'H  —  'H'.  □ 
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Proof  of  Corollary  12.9.  Let  fa  be  as  in  Lemma  12.8,  and  define  for  any  m  e  N,  s  e  [1,  oo), 
C  G  [0,  oo],  and  TL  C  [T7], 

U'e('H,C,'PxY,m,s) 

=  A'  L;(D,(|«](C;Q)-«;m,'Pxy)  +  D(([«](CA))i/^  +  -)  ■ 

V  V  Ul  Ul  J 

That  is,  Uf  is  the  function  IJ(  that  would  result  from  using  fae  in  place  of  fa.  Let  U  =  DISF(7f), 
and  suppose  V(U)  >  0.  Then  since  DISFQ'H])  =  DISF("H)  implies 

M^KC;*))  =  Bem(CJ)]Vu)VvW) 

=  D  ,([n}(C/V(U)-J,Vu)]Vu)Vv{U): 


a  little  algebra  reveals  that  for  m>2V(U)  , 


U'e(n,(;VxY,m,s)  <  32>V{U)Ut{U,<jV{U)\Vu,  \{l/2)V{U)m\s) 


(12.57) 


In  particular,  for  j  >  jg,  taking  %  =  Tn  we  have  (from  the  definition  of  Tf)  U  =  DISF (TL)  = 
DISCH)  =  Uj,  so  that  when  V{Uf)  >  0,  any 


m 


>  2 T(Uf)~xMg  ( 


„  /  O-J-l  o2-j 

-1a/t  .  z  z 


Fj,'PUj,s{2m) 


33V {UjY  V{Uj) 

suffices  to  make  the  right  side  of  (12.57)  (with  s  =  s(2 m)  and  £  =  22~-?)  at  most  2_j  l;  in 
particular,  this  means  taking  Uj  equal  to  2 m  V  Uj-  ]  V  2 Mj_2  for  any  such  m  (with  log2(m)  G 
N)  suffices  to  satisfy  (12.12)  (with  the  in  (12.12)  defined  with  respect  to  the  fa  function); 
monotonicity  of  (  G  (C>  ^^;^,'P«,-,s(2m)^  implies  (12.14)  is  a  sufficient  condition 
for  this.  In  the  special  case  where  V(Uj)  =  0,  22_J;  Vxy,  m,  s)  =  K^,  so  that  taking 


u. 


>  KEs(uj)2:>+2\/Uj-i\/2uj-i  suffices  to  satisfy  (12.12)  (again,  with  the  in  (12.12)  defined 


in  terms  of  fa).  Plugging  these  values  into  Theorem  12.7  completes  the  proof.  □ 

Proof  of  Theorem  12.16.  Let  je  =  flog2(l/^^(e))“| .  For  jt  <  j  <  j£,  let  sj  =  Log  ^48(2+>  j) 
and  define  Uj  =  2^log2^^,  where 


fa  =  c'  (62J(2  /3)  +  £2J)  (vc  (Gjr)  Log  (xd)  +  Sj)  , 


(12.58) 
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for  an  appropriate  universal  constant  d  G  [1,  oo).  A  bit  of  calculus  reveals  that  for  jg  +  2  < 
j  <  j£,  u'j  >  u']_x  and  u'-  >  2w'_2,  so  that  Uj  >  Uj_ i  and  uj  >  2uj_2  as  well;  this  is  also 
trivially  satisfied  for  j  G  {jg,jg  +  1}  if  we  take  Uj_ 2  =  1  in  these  cases  (as  in  Theorem  12.7). 
Combining  this  fact  with  (12.31),  (12.8),  and  (12.9),  we  find  that,  for  an  appropriate  choice  of 
the  constant  d,  these  Uj  satisfy  (12.12)  when  we  define  s  such  that,  for  every  j  G  {jg, . . .  ,j£}, 
Vm  G  {2uj-i, . ,  u  j }  with  log2(m)  G  N, 


s(m)  =  Log 


Additionally,  let  s  =  log2  (2 /A ) . 


12  log2  (4 Uj/mf  (2  +  j£  -  j)' 


Next,  note  that,  since  ^g{e)  <  Tg (e)  and  Uj  is  nondecreasing  in  j, 

uje  <  u-je  <  26 d  +  ^7) )  (vc  (&0  L°S  M  +  Log(l/<5))  , 

so  that,  for  any  c  >  26c',  we  have  u  >  uis,  as  required  by  Theorem  12.7. 

For  Uj  as  in  Theorem  12.7,  note  that  by  Condition  12.10  and  the  definition  of  9, 

V  {Uj)  =  V  (DIS  (F  {Eg  (22_J)  ;  oi) ) )  <  V  (DIS  (B  (h*,a8.e  (22^)“))) 

<  9  max  {a  Eg  (22~J)“  ,  aea{  <  9  max  (a'L)!1  (22-JI) °  ,  aea{  . 

Because  ^g  is  strictly  increasing  on  (0, 1),  for  j  <je,  ^ (  1  (22  j)>  £,  so  that  this  last  expression 
is  equal  to  da'L^T1  (22~-7)Q.  This  implies 


E  t>  («j)  %  <  E  v  («j 


Eo.W,-1  (22-')“  (62«2-»  +  «■')  (A!  +  Log  (2  -  j.  -  j))  . 


(12.59) 


We  can  change  the  order  of  summation  in  the  above  expression  by  letting  i  =  j£  —j  and  summing 
from  0  to  N  —  je  —  jg.  In  particular,  since  2je  <  2/^g{e),  (12.59)  is  at  most 

JX  /  \  a  /  Ah2i^~2)  2f2~l\ 

?  a9^1  (22^20  +  )  (Al  +  Log(^  +  2))  •  (12-60) 
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Since  x  i-»  'i>/1(x)/x  is  nonincreasing  on  (0,  oo),  ^22_Je2*j  <  2i+2d>71  (2~je^j,  and 
since  d/^1  is  increasing,  this  latter  expression  is  at  most  2l+2^/1  (^(e))  =  2*+2£.  Thus,  (12.60) 
is  at  most 


N 


16 a6ea  Yj 


i= 0 


^2*0+/3-2)  ^2*(a~1) 

+  ' 


(A\  +  Log(i  +  2)) . 


(12.61) 


^(e)2-/3  '  ^(e) 

In  general,  Log(i  +  2)  <  Log(N  +  2),  so  that  E*Lo  2^Q+/3-2)  (A1  +  Log()  +  2))  <  ( A-L  + 
Log()V+2))()V+l)  and  £  f=0  2i(“"1)  (A,  +  Log (i  +  2))  <  (A1+Log(lV+2))(iV+l).  When  a+ 

P  <  2,  we  also  have  E,=o  2^+^2)  <  2^+^2)  =  1_2(a1+,_2)  and  E^o  2^-2)Log(i  + 

2)  <  E“o2i(Q+/3'2)L°g(i+2)  <  l=^^2)L°g  Similarly,  if  a  <  1,  Eio  2<(“_1)  < 

E“o2*(a_1)  =  i— 2(^—1)  and  likewise  Eio  2i(a_1)Log(i  +  2)  <  E“o  2i("_1)Log()  +  2)  < 
1_2(2a_i)  Log  ^  j  •  By  combining  these  observations  (along  with  a  convention  that  1_2(Eo  = 

oo  when  a  —  1,  and  1_2(ct1+/3-2)  =  oo  when  a  =  /3  =  1),  we  find  that  (12.61)  is 


<  a6e° 


b{Ai  +  Log  (B^Bi  +AAi  +  Log(C'1))C1 


Thus,  for  an  appropriately  large  numerical  constant  c,  any  n  satisfying  (12.33)  has 


Je 

n  >  s  +  2e  V(Uj)uj, 
j=j  e 

as  required  by  Theorem  12.7. 

Finally,  we  need  to  show  the  success  probability  from  Theorem  12.7  is  at  least  1  —  5,  for  s 
and  s  as  above.  Toward  this  end,  note  that 


5 


IoS2  (uje) 

6e"^2i) 

i=i 

je  log2(«j) 

<  y  y  - 

PS  fc.oj7-.Hi  2  <2  +  l0S2<%)  -  >)2  (2  +  3.  -  j) 

je  \og2(Uj/uj- 1)  — 1 

=  y  y  _ 1 _ 

P S  PS  2(2  +  i)2(2+i,-j)2 


A  2 


Je 

<E 

3=31 


5 


< 


E 


5 


2  (2  T  jV  j) 2  “■  2(2  +  t) 


<  5/2. 
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Noting  that  2  s  =  5/ 2,  we  find  that  indeed 

Iog2(%e) 

1  -  2"'  -  Y  6e"s(2i)  >1  —  5. 

i=i 

Therefore,  Theorem  12.7  implies  the  stated  result.  □ 


Proof  Sketch  of  Theorem  12.17.  The  proof  follows  analogously  to  that  of  Theorem  12.16,  with 
the  exception  that  now,  for  each  integer  j  with  jf  <  j  <  je,  we  replace  the  definition  of  u’j  from 
(12.58)  with  the  following  definition.  Letting  c3  =  vc(f/x)Log  ^  (1/ b)  (aO 2j'5/J1(22~j)a)13^ , 
define 

u'j  =  d  (a0^f\ 22~j)a)1~f3  +  & 7 )  (Cj  +  Sj) , 


where  d  G  [1,  oo)  is  an  appropriate  universal  constant,  and  s3  is  as  in  the  proof  of  Theorem  12.16. 
With  this  substitution  in  place,  the  values  u3  and  s,  and  function  s,  are  then  defined  as  in  the  proof 
of  Theorem  12.16.  Since  x  i->  xT^1  (1/a;)  is  nondecreasing,  a  bit  of  calculus  reveals  u3  >  u3_i 
and  Uj  >  2uj_2-  Combined  with  (12.35),  (12.9),  (12.8),  and  Lemma  12.12,  this  implies  we  can 
choose  the  constant  d  so  that  these  u3  satisfy  (12.14).  By  an  identical  argument  to  that  used  in 
Theorem  12.16,  we  have 

log2(%e) 

i  -  2~*  -  y  6e~H2i)  >  1  “  S- 

i=  1 

It  remains  only  to  show  that  any  values  of  u  and  n  satisfying  (12.36)  and  (12.37),  respectively, 
necessarily  also  satisfy  the  respective  conditions  for  u  and  n  in  Corollary  12.9. 

Toward  this  end,  note  that  since  x  i-G  x'T^1(l/x)  is  nondecreasing  on  (0,  oo),  we  have  that 


UT  <  U]e  Z 


6(afcQ)1_/3  i  \ 

+  J  2 


Thus,  for  an  appropriate  choice  of  c,  any  u  satisfying  (12.36)  has  u  >  u3e ,  as  required  by  Corol¬ 
lary  12.9. 
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Finally,  note  that  for  U3  as  in  Theorem  12.7,  and  i3  =  j£  —  j. 


22~T% 

3  =3 1  3=3 1 

<  b  (aB2^rt'(22-irf-S  (A2  +  Log  ft  +  2)) 

5=5 1 

je 

+  ia92j^J1(22~j)a  (A2  +  Log  {i3  +  2)) . 

3=31 


By  changing  the  order  of  summation,  now  summing  over  values  of  i3  from  0  to  N  =  j£~ je  < 

log2(4 i/^t{e)),  and  noting  2~je  <  2/'L£(e),  and  ^/J1(2~~je22+l)  <  22+le  for  i  >  0,  this  last 
expression  is 


/  a62i('a~v>  ea\ 

V  Me)  ) 


2-/3 

(A2  +  Log  (i  +  2)) 


ta6  2*(«-1)£« 

^t(e) 


(A2  +  Log  {i  +  2)) . 


(12.62) 


Considering  these  sums  separately,  we  have  YZLo  2**'°  1)(2  ^(2l2  +  Log(i  +  2))  <  (iV  +  l)(A2  + 
Log(iV  +  2))  and  YZZo  2i(q_1)(A2  +  Log()  +  2))  <  (N  +  1)(A2  +  Log(iV  +  2)).  When  a  <  1, 
we  also  have  YliLo  2t('a~1')(2~l3\A2  +  Log(i  +  2))  <  2lA-l)(2-P)(A2  +  Log(i  +  2))  < 

1_2(Q-1)(2-/ 3)  Log  ( ! _ 2(a—i)(2—/3) )  +  1_2ca-1i)(2-^)  M,  and  similarly  E*=0  2<(a-1)(42  +  Log(i  +  2))  < 

l_ja-i)A2  +  I=J^u Log  (1_2(1a_1)).  Thus,  generally  ^=o  2*(a-1)(2-^(g42  +  Log(i  +  2))  < 
B2(A2  +  Log(£2))  and  2*(a_1)(A2  +  Log(i  +  2))  <  C2(A2  +  Log(C2)).  Plugging  this  into 
(12.62),  we  find  that  for  an  appropriately  large  numerical  constant  c,  any  n  satisfying  (12.37)  has 
n  >  YZj=jt  'P(Mj)uj’  as  required  by  Corollary  12.9.  □ 


12.7  Results  for  Efficiently  Computable  Updates 

Here  we  include  more  detailed  sketches  of  the  arguments  leading  to  computationally  efficient 
variants  of  Algorithm  1,  for  which  the  specific  results  proven  above  for  the  given  applications 
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remain  valid.  Throughout  this  section,  we  adopt  the  notational  conventions  introduced  in  the 
proof  of  Theorem  12.7  (e.g.,  V(m),  V(m),  Qm,  Cm,  S ),  except  in  each  instance  here  these  are 
defined  in  the  context  of  applying  Algorithm  1  with  the  respective  stated  variant  of  Te. 


12.7.1  Proof  of  Theorem  12.16  under  (12.34) 

We  begin  with  the  application  to  VC  Subgraph  classes,  first  showing  that  if  we  specify  Te(V ;  0,  m) 
as  in  (12.34),  the  conclusions  of  Theorem  12.16  remain  valid.  Fix  any  s  function  (to  be  specified 
below),  and  fix  any  value  of  £  E  (0, 1).  First  note  that,  for  any  m  with  log2(m)  G  N,  by  a  Cher- 
noff  bound  and  the  law  of  total  probability,  on  an  event  E"n  of  probability  at  least  1  —  2 1  ~s(m) ,  if 
m  E  S,  then 

(1/2 )mV(Dm)  -  y/ s(m)mV(Dm )  <  \Qm\  <  s(m )  +  erriP(Dm).  (12.63) 

Also  recall  that,  for  any  m  with  log2(m)  E  N,  by  Lemma  12.4  and  the  law  of  total  probability, 
on  an  event  Em  of  probability  at  least  1  —  6e~s^m\  if  m  E  S  and  h*  E  V(rn> ,  then 

(|Qm|  V  1)  f  R^(/i*;  Qrn)  mf  R^fisQm)] 

V  g£V( m)  ) 

{h* ;  Cm)  -  inf  R*  (gDrn ;  Cm ) 

<  J&e  (y™;VXY,m/2,s(m))  (12.64) 

and  \/h  E  V^m\ 

777 

-  (R e(hDm)  -  R e(h')) 

<  ^  (r e(hDm-,Cm)  -  R  e(h*;  Cm)  +  Ut  (v^-,VXY,  m/2,  s(m))  A  tj 

=  \Qm\  (R e{h]  Qm )  -  'Rt(h*i  Qm))  +  j  (ue  (vj£-,VXY,  m/2,  s(m))  A  tj 

<  ( \Qm\  V  l)Te  (vW-,Qm,rn)  +  j  (Ue  (v^-VXY,m/2,5(m)^  A  tj  .  (12.65) 
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Fix  a  value  ie  G  N  (an  appropriate  value  for  which  will  be  determined  below),  and  let  xi  = 

For  rn  G  N  with  log2(m)  G  N,  let 

T£(m)  =  c2  (vc(^jr)Log(x/)  +s(m))^ 

£ 

+  c2—  (vc(^jr)Log(x^)  +  s(m))  , 
m 

for  an  appropriate  universal  constant  c2  G  [1,  oo)  (to  be  determined  below);  for  completeness, 
also  define  T)(l)  =  £.  We  will  now  prove  by  induction  that,  for  an  appropriate  value  of  the 
constant  c0  in  (12.34),  for  any  ml  with  log2(m')  G  {1, . . . ,  ie},  on  the  event  )_1  E2i  D 

£"+1,  if  ml  G  S,  then  h*  G  V^m'\ 

V£nj  C  [J](7m//2;  £)  C  [J-](27Km'/2)  V 

c  J-(£,(7m72);oi)  C  E(8.£{2T£(m'/2)  V^(e));oi), 

^  (^?;^y,m72,5(m'))  Af  <  Qm>,  m')  A  f )  , 

and  if  7m//2  > 

1  A£)  <  f£(m'). 

As  a  base  case  for  this  inductive  argument,  we  note  that  for  ml  =  2,  we  have  (by  definition) 
7m'/2  =  7  and  furthermore  (if  c0  A  c2  >  2)  tt(V(2h  Q2,  2)  >  £  and  f)(l)  >  £,  so  that  the 
claimed  inclusions  and  inequalities  trivially  hold.  Now,  for  the  inductive  step,  take  as  an  inductive 
hypothesis  that  the  claim  is  satisfied  for  ml  =  m  for  some  rn  G  N  with  log2(m)  G  {1, . . . ,  ie  —  1}. 
Suppose  the  event  H  E”i+1  occurs,  and  that  2m  G  S.  By  the  inductive  hypothesis, 

combined  with  (12.64)  and  the  fact  that  (\Qm\  V  1)R e(h*;  Qrn)  <  (m/ 2)1,  we  have 

(\Qrn\  V  1)  (  R£(h*-,Qm)  -  inf  R£(g;  Qm) ) 

V  g£V(™)  ) 

<  y  (Ut  {v^-VXY,m/2,s(m^  A  t)  <  (\Qm\  V  1  )Te  (V^;Qm,m)  . 
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Therefore,  h*  E  as  well,  which  implies  h*  E  =  Vr(m) .  Furthermore,  by  (12.65),  the 

inductive  hypothesis,  and  the  definition  of  V(m>  from  Step  6,  V/i  E  V(2ni'1  =  V^m\ 

R i{hDm)  ~  Mh*)  <  2IQ/^1  (ft  (V(m);  Qm,rn)  A  t)  , 

and  if  7 m/2  >  46(e),  then  this  is  at  most  2T)(m). 

Since  7m  =  2^j^  ^  (l/M;  Qm,  m)  Af),  and  R*(/ir>2m)  <  R<(^aJ  for  every  h  e 
R(2m)d,  we  have  C  [Jr](7m;^)  C  [Jr](2T)(m)  V  46(e);  £).  By  definition  of  £?(•),  we 

also  have  er (hD2rri)  —  er (h*)  <  £^(7 m)  for  every  h  E  since  h*  E  v(2m\  we  have 

sign (JiD2m)  =  sign(/i),  so  thater(/i)—  er(h*)  <  ££(7™)  as  well:  that  is,  C(2m)  C  .F(£^(7m);  01)  C 
.F(£^(2 Te(m)  V  46(e));  01).  Combining  these  facts  with  (12.5),  (12.25),  Condition  12.11,  mono¬ 
tonicity  of  vc (Ghu)  in  both  U  and  H,  and  the  fact  that  \\F(Qv(2m)  ,p  )\\vXY  —  we 

have  that 


-Cl\ 


by 


p  vc (^) Log  +  5 (2m) 


m 


+  Cl 


_vc((yjr)Log  (gP(D^2m))  +  s(2m) 

n  \  vim  / 


m 


,  (12.66) 


for  some  universal  constant  <7  e  [1,  00).  By  (12.63),  we  have  V(D2m)  <  p-(|Q2m|  +  s(2m)),  so 
that  the  right  hand  side  of  (12.66)  is  at  most 


Cl 


\ 


by„ 


vc(g7Log(g^f^)+i(2m) 


m 


+  Cl 


,vc(^)Log(^'^fm)))+^(2") 

m 


<  8ci 


\ 


by 


gvc(g7Log(f»'7:;^"‘»)+a(2,n) 


2  m 


8ci£ 


vc(^)Log(^j^»)+i(2m) 
2m 


Thus,  if  we  take  Co  =  8ci  in  the  definition  of  7)  in  (12.34),  then  we  have 

^  (VD2hiVXY,  m,  5(2  m))  A  £  <  |g2^Vl  (l)  (C^;  Q2m,  2  m)  A  f)  . 
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Furthermore,  (12.63)  implies  \Q2m\  <  s(2m)+2em'P(-D2m).  In  particular,  if  s (2m)  >  2emV(D2m), 
then 

V  1  (f,  Q2m,  2m)  A  l)  <  i(2m)  + 

m  V  v  '  )  m  m 

and  taking  any  c2  >  4  guarantees  this  last  quantity  is  at  most  T)(2m).  On  the  other  hand, 

if  s(2m)  <  2emV(D2m),  then  |Q2m|  <  4 emV(D2rn),  and  we  have  already  established  that 

^y(2m)  g  Jr(££(7m);  oi),  so  that 


|g2m|  V  1  (fe  {V{2m);Q2m,  2m)  A  / 


<  8ci 


\ 


VC(^)Log  (^(DIS(^(£A7m);oi)))\  +  £(2m) 

/3  \  ^7m  / 


2m 


+  8cyO 


VC(^)Log  +5(2m) 


_&7jt 

2  m 


(12.67) 


If  7m  >  ^(e),  then  this  is  at  most 


o„  i  ,  i  ut,P  vc(^)L°g  (3exr^)  +  S (2m)  |  _vc(^)Log  (3ex/)  +  s{2 m) 

oCi  I  t  /  Oym  ^  + 


2  m 


2m 


<  48ci  I  \llryP  vc(^)Log  +  *(2m)  +  _jVc(^)Log  (x^)  +  s(2m) 


2m 


2m 


For  brevity,  let  K  =  vc(g^)Lo^*^)+s(2m)  .  argUCt|  above?  <  2 T^(m),  so  that  the  right  hand 
side  of  the  above  inequality  is  at  most 

48\/2ci  (^JbfE{mYK  +  £K^j  . 

Then  since  s(m)  <  2s (2m),  the  above  expression  is  at  most 

(  1  (12.68) 


48  ■  4ci  I  \l  b  [  ( bK )  p?  V  £K  J  K  +  IK  )  . 


If  tK  <  ( bK)2-P ,  then  (12.68)  is  equal 


48  •  4civ^  {ibK)^P  +  £K^j  . 
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On  the  other  hand,  if  £K  >  (bK)  ,  then  (12.68)  is  equal 


48  •  4civ^  ^\JbK(£K)P  +  £K^j 

<  48  •  4ciy/C2  (^J \tK)2~P{lKy  +  £K^j  =  48  •  8 c^lK. 

In  all  of  the  above  cases,  taking  c2  =  9  •  2l4cy  in  the  dehnition  of  X  yields 

|g2^Vl  (t,  (U(2m);  Q2m,  2m)  A  tj  <  7} (2m). 

This  completes  the  inductive  step,  so  that  we  have  proven  that  the  claim  holds  for  all  m!  with 

log2(m')  G  {l,...,ij. 

Let  jg  =  -flog2(£)],  j£  =  |log2(l/^(e))l,  and  for  each  j  G  {je, . . . ,  je},  let  sj  = 
log2  ^144(2+L-1)  ^  define 

m'j  =  32 c2  (62j(2"/3)  +  £2J)  (vc(^)Log(y/)  +  sj)  , 


and  let  nij  =  2^I°S2('A-|T .  Also  define  m]e-i  =  1.  Using  this  notation,  we  can  now  define 
the  relevant  values  of  the  s  function  as  follows.  For  each  j  G  {je, . . . ,  j£},  and  each  m  G 

{rrij- 1  +  1 .  ,rrij}  with  log2(m)  G  N,  define 

s(m)  =  log2  ( 161°g2(4mj/my(2  +  j,-j)n 


In  particular,  taking  X  =  log2  (mjJ,  we  have  that  2X^(2*®  4)  <  ^(e),  so  that  on  the  event 
DS1  x2i  n  X"+1,  if  we  have  2<e  G  X,  then  h  G  U(2i£)  C  X’(£^(27K2^-1)  V  ^(e));0i)  = 
X’(£^(^(£));  oi)  C  X'(4/71('k£(e));  oi)  =  X"(£;  oi),  so  that  er (h)  —  er (h*)  <  e. 

Furthermore,  we  established  above  that,  on  the  event  fX^1  E2t  fl  X"+1,  for  every  j  G 
{je, ...  ,j£}  with  nij  G  S,  and  every  m  G  {mj_ i  +  1 , . . . ,  rrij }  with  log2(m)  G  N,  U(m-)  C 
Jr{8.e{2Te{m/2)  V  ^(e));oi)  C  Jr(£€(2X£(mj_i)  V  ^(e));oi).  Noting  that  2X)(mi_i)  <  21_X 
we  have 

je  mj 


\Qm\  < 


IdIS(J7(£^(21-J);oi))  (*m)- 


m£S:m<m~  j=ji  m=rrij— i+l 
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A  Chernoff  bound  implies  that,  on  an  event  E'  of  probability  at  least  1  —  5/2,  the  right  hand  side 
of  the  above  inequality  is  at  most 


Je 


log2(2/5)  +  2e^K-  -mj_1)P(DIS(J-(£,(21^);oi))) 

3=3t 

je 

<  log2(2 /S)  +  2e  ^  „,))). 

3=U 

By  essentially  the  same  reasoning  used  in  the  proof  of  Theorem  12.16,  the  right  hand  side  of  this 
inequality  is 

<  a0£a  ( +  Log(-Bi))i?i  +  Log(Ci))Ci 


^e(£)2-P 


Since 


my 


< 


+ 


®e(e) 


Ai, 


\  Ve(£)2-e 

the  conditions  on  u  and  n  stated  in  Theorem  12.16  (with  an  appropriate  constant  c)  suffice  to 
guarantee  er(fi)  —  er(fi*)  <  £  on  the  event  EJ  n  f'/i , 1  E-r  D  E"t+1 .  Finally,  the  proof  is  completed 
by  noting  that  a  union  bound  implies  the  event  E'  D  f^-E, 1  E2,  D  E”,+l  has  probability  at  least 

is— 1 


l---^21"^+1)  +  6e 


-S(2*) 


i=  1 


> 


> 


5 


je  l°g2  K) 


E 

3=31  i=log2(mj_i)+l 

r  je  oo 

i-  -EE 


5 


2(2  +  log2(mj)  -  i)2( 2  +  j£  -  j )2 

5 


2  W  +  WP+je-jY 


5 


j=ji  fc=o 

je 


-E 


5 


5 


2  2(2  +  j£  —  j)2 


-E 


5 


2  ^2(2  +  t) 


2  — 


>1-5. 


Note  that,  as  in  Theorem  12.16,  the  function  s  in  this  proof  has  a  direct  dependence  on  a, 
a,  and  \e,  in  addition  to  b  and  3.  As  before,  with  an  alternative  definition  of  s,  similar  to  that 
mentioned  in  the  discussion  following  Theorem  12.16,  it  is  possible  to  remove  this  dependence, 
at  the  expense  of  the  same  logarithmic  factors  mentioned  above. 
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Chapter  13 


Online  Allocation  and  Pricing  with 
Economies  of  Scale 


Abstract 

'Allocating  multiple  goods  to  customers  in  a  way  that  maximizes  some  desired  objective  is  a 
fundamental  part  of  Algorithmic  Mechanism  Design.  We  consider  here  the  problem  of  offline 
and  online  allocation  of  goods  that  have  economies  of  scale,  or  decreasing  marginal  cost  per  item 
for  the  seller.  In  particular,  we  analyze  the  case  where  customers  have  unit-demand  and  arrive 
one  at  a  time  with  valuations  on  items,  sampled  iid  from  some  unknown  underlying  distribution 
over  valuations.  Our  strategy  operates  by  using  an  initial  sample  to  learn  enough  about  the 
distribution  to  determine  how  best  to  allocate  to  future  customers,  together  with  an  analysis  of 
structural  properties  of  optimal  solutions  that  allow  for  uniform  convergence  analysis.  We  show, 
for  instance,  if  customers  have  binary  valuations  over  items,  and  the  goal  of  the  allocator  is  to 
give  each  customer  an  item  he  or  she  values,  we  can  efficiently  produce  such  an  allocation  with 
cost  at  most  a  constant  factor  greater  than  the  minimum  over  such  allocations  in  hindsight,  so 
long  as  the  marginal  costs  do  not  decrease  too  rapidly.  We  also  give  a  bicriteria  approximation 
''this  chapter  is  based  on  joint  work  with  Avrim  Blum  and  Yishay  Mansour. 
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to  social  welfare  for  the  case  of  more  general  valuation  functions  when  the  allocator  is  budget 
constrained. 


13.1  Introduction 

Imagine  it  is  the  Christmas  season,  and  Santa  Claus  is  tasked  with  allocating  toys.  There  is  a 
sequence  of  children  coming  up  with  their  Christmas  lists  of  toys  they  want.  Santa  wants  to  give 
each  child  some  toy  from  his  or  her  list  (for  simplicity,  assume  all  children  have  been  good  this 
year).  But  of  course,  even  Santa  Claus  has  to  be  cost-conscious,  so  he  wants  to  perform  this 
allocation  of  toys  to  children  at  a  near-minimum  cost  to  himself  (call  this  the  Thrifty  Santa  Claus 
Problem).  Now  if  it  was  the  case  that  every  toy  had  a  fixed  price,  this  would  be  easy:  simply 
allocate  to  each  child  the  cheapest  toy  on  his  or  her  list  and  move  on  to  the  next  child.  But  here 
we  are  interested  in  the  case  where  goods  have  economies  of  scale.  For  example,  producing  a 
millon  toy  cars  might  be  cheaper  than  a  million  times  the  cost  of  producing  one  toy  car.  Thus, 
even  if  producing  a  single  toy  car  is  more  expensive  than  a  single  Elmo  doll,  if  a  much  larger 
number  of  children  want  the  toy  car  than  the  Elmo  doll,  the  minimum-cost  allocation  might  give 
toy  cars  to  many  children,  even  if  some  of  them  also  have  the  Elmo  doll  on  their  lists. 

The  problem  faced  by  Santa  (or  by  any  allocator  that  must  satisfy  a  collection  of  disjunctive 
constraints  in  the  presence  of  economies  of  scale)  makes  sense  in  both  offline  and  online  settings. 
In  the  offline  setting,  in  the  extreme  case  of  goods  such  as  software  where  all  the  cost  is  in  the  first 
copy,  this  is  simply  weighted  set-cover,  admitting  a  ©(log  n)  approximation  to  the  minimum-cost 
allocation.  We  will  be  interested  in  the  online  case  where  customers  are  iid  samples  from  some 
arbitrary  distribution  over  subsets  of  item-set  X  (i.e.,  Christmas  lists),  where  the  allocator  must 
make  allocation  decisions  online,  and  where  the  marginal  cost  of  goods  does  not  decrease  so 
sharply.  We  show  that  for  a  range  of  cost  curves,  including  the  case  that  the  marginal  cost  of  copy 
t  of  an  item  is  t~a,  for  some  a  G  [0, 1),  we  will  be  able  to  get  a  constant-factor  approximation  so 
long  as  the  number  of  customers  is  sufficiently  large  compared  to  the  number  of  items. 
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One  basic  observation  we  show  is  that,  if  the  marginal  costs  are  non-increasing,  there  is  al¬ 
ways  an  optimal  allocation  that  can  be  described  as  an  ordering  of  the  possible  toys,  so  that  as 
each  child  comes,  Santa  simply  gives  the  child  the  first  toy  in  the  ordering  that  appears  on  the 
child’s  list.  Another  observation  we  prove  is  that,  if  the  marginal  costs  do  not  drop  too  quickly, 
then  if  we  are  given  the  lists  of  all  the  children  before  determining  the  allocation,  we  can  effi¬ 
ciently  find  an  allocation  that  is  within  a  constant  factor  of  the  minimum-cost  allocation,  as  op¬ 
posed  to  the  logarithmic  factor  required  for  the  set-cover  problem.  Since,  however,  the  problem 
we  are  interested  in  does  not  supply  the  lists  before  the  allocations,  but  rather  requires  a  decision 
for  each  child  in  sequence,  we  rely  on  the  iid  assumption  and  use  ideas  from  machine  learning, 
as  follows:  after  processing  a  small  initial  number  of  children  (with  no  nontrivial  guarantees  on 
allocation  costs  for  these),  we  take  their  wish  lists  as  representative  of  the  future  children,  and 
find  the  optimal  solution  (in  hindsight)  for  those,  while  treating  each  of  these  children  as  repre¬ 
senting  many  future  children  (supposing  we  know  the  total  number  of  children  ahead  of  time). 
We  then  take  the  ordered  list  of  toys  from  this  solution,  and  allocate  according  to  this  preference 
ordering  in  the  future  (allocating  to  each  child  the  earliest  toy  in  the  ordering  that  is  also  on  his 
or  her  list).  We  show  that,  as  long  as  we  take  a  sufficiently  large  number  of  initial  children,  this 
procedure  will  find  an  ordering  that  will  be  near-optimal  for  allocating  to  the  remaining  children. 

More  generally,  we  can  imagine  the  case  where,  rather  than  simple  lists  of  items,  the  lists 
also  provide  valuations  for  each  item,  and  we  are  interested  in  the  trade-off  between  maximizing 
the  total  of  valuations  for  allocated  items  while  minimizing  the  total  cost  of  the  allocation.  In  this 
case,  we  might  think  of  the  allocator  as  being  a  large  company  with  many  different  projects,  and 
each  project  has  some  valuations  over  different  resources  (e.g.,  types  of  laptops  for  employees 
involved  in  that  project),  where  it  could  use  one  or  another  resource  but  prefers  some  resources 
over  others.  One  natural  quantity  to  consider  in  this  context  is  the  social  welfare:  the  difference 
between  the  happiness  (total  of  valuations  for  the  allocation)  minus  the  total  cost  of  the  allocation. 
In  this  case,  it  turns  out  the  optimal  allocation  rule  can  be  described  by  a  pricing  scheme.  In 


282 


another  words,  whatever  the  optimal  allocation  is,  there  always  exist  prices  such  that  if  the  buyers 
purchase  what  they  most  want  at  those  prices,  they  will  actually  produce  that  allocation.  We  note 
that,  algorithmically,  this  is  a  harder  problem  than  the  list-based  problem  (which  corresponds  to 
binary  valuations). 

Aside  from  social  welfare,  it  is  also  interesting  to  consider  a  variant  in  which  we  have  a 
budget  constraint,  and  are  interested  in  maximizing  the  total  valuation  of  the  allocation,  subject 
to  that  budget  constraint  on  the  total  cost  of  the  allocation.  It  turns  out  this  latter  problem  can  be 
reduced  to  a  problem  known  as  the  weighted  budget  maximum  coverage  problem.  Technically, 
this  problem  is  originally  formulated  for  the  case  in  which  the  marginal  cost  of  a  given  item 
drops  to  zero  after  the  first  item  of  that  type  is  allocated  (as  in  the  set  cover  reduction  mentioned 
above);  however,  viewed  appropriately,  we  are  able  to  formulate  this  reduction  for  arbitrary 
decreasing  marginal  cost  functions.  What  we  can  then  do  is  run  an  algorithm  for  the  weighted 
budget  maximum  coverage  problem,  and  then  convert  the  solution  into  a  pricing.  As  before,  this 
strategy  will  be  effective  for  the  offline  problem,  in  which  all  of  the  valuations  are  given  ahead  of 
time.  However,  we  can  extend  it  to  the  online  setting  with  iid  valuation  functions  by  generating 
a  pricing  based  on  an  appropriately-sized  initial  sample  of  valuation  functions,  and  then  apply 
that  pricing  to  sequentially  generate  allocations  for  the  remaining  valuations.  Again,  as  long  as 
the  marginal  costs  are  not  decreasing  too  rapidly,  we  can  obtain  an  allocation  strategy  for  which 
the  sum  of  valuations  of  the  allocated  items  will  be  within  a  constant  factor  of  the  maximum 
possible,  subject  to  the  budget  constraint  on  the  cost. 

13.1.1  Our  Results  and  Techniques 

We  consider  this  problem  under  two,  related,  natural  objectives.  In  the  first  (the  “thrifty  Santa 
Claus”  objective)  we  assume  customers  have  binary  {0, 1}  valuations,  and  the  goal  of  the  seller  is 
to  give  each  customer  a  toy  of  value  1,  but  in  such  a  way  that  minimizes  the  total  cost  to  the  seller. 
We  show  that  so  long  as  the  number  of  buyers  n  is  large  compared  to  the  number  of  items  r,  and 
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so  long  as  the  marginal  costs  do  not  decrease  too  rapidly  (e.g.,  a  rate  1  / ta  for  some  0  <  a  <  1), 
we  can  efficiently  perform  this  allocation  task  with  cost  at  most  a  constant  factor  greater  than  that 
of  the  optimal  allocation  of  items  in  hindsight.  Note  that  if  costs  decrease  much  more  rapidly, 
then  even  if  all  customers’  valuations  were  known  up  front,  we  would  be  faced  with  (roughly) 
a  set-cover  problem  and  so  one  could  not  hope  to  achieve  cost  o(logn)  times  optimal.  The 
second  objective  we  consider,  which  we  apply  to  customers  of  arbitrary  unit-demand  valuation, 
is  that  of  maximizing  total  social  welfare  of  customers  subject  to  a  cost  bound  on  the  seller;  for 
this,  we  also  give  a  strategy  that  is  constant-competitive  with  respect  to  the  optimal  allocation  in 
hindsight. 

Our  algorithms  operate  by  using  initial  buyers  to  learn  enough  about  the  distribution  to  de¬ 
termine  how  best  to  allocate  to  the  future  buyers.  In  fact,  there  are  two  main  technical  parts  of 
our  work:  the  sample  complexity  and  the  algorithmic  aspects.  From  the  perspective  of  sample 
complexity,  one  key  component  of  this  analysis  is  examining  how  complicated  the  allocation 
rule  needs  to  be  in  order  to  achieve  good  performance,  because  simpler  allocation  rules  require 
fewer  samples  in  order  to  learn.  We  do  this  by  providing  a  characterization  of  what  the  op¬ 
timal  strategies  look  like.  For  example,  for  the  thrifty  Santa  Claus  version,  we  show  that  the 
optimal  solution  can  be  assumed  wlog  to  have  a  simple  permutation  structure.  In  particular,  so 
long  as  the  marginal  costs  are  nonincreasing,  there  is  always  an  optimal  strategy  in  hindsight  of 
this  form:  order  the  items  according  to  some  permutation  and  for  each  bidder,  give  it  the  ear¬ 
liest  item  of  its  desire  in  the  permutation.  This  characterization  is  used  inside  both  our  sample 
complexity  results  and  our  algorithmic  guarantees.  Specifically,  we  prove  that  for  cost  function 
cost(t)  =  Y^r= i  l/7"0’  f°r  a  e  [0?  1)’  running  greedy  weighted  set  cover  incurs  total  cost  at 
most  j^OPT.  More  generally,  if  the  average  cost  is  within  some  factor  of  the  marginal  cost, 
we  have  a  greedy  algorithm  that  achieves  constant  approximation  ratio.  To  allocate  to  new  buy¬ 
ers,  we  simply  give  it  the  earliest  item  of  its  desire  in  the  learnt  permutation.  For  the  case  of 
general  valuations,  we  give  a  characterization  showing  that  the  optimal  allocation  rule  in  terms 
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of  social  welfare  can  be  described  by  a  pricing  scheme.  That  is,  there  exists  a  pricing  scheme 
such  that  if  buyers  purchased  their  preferred  item  at  these  prices,  the  optimal  allocation  would 
result.  Algorithmically,  we  show  that  we  can  reduce  to  a  weighted  budgeted  maximum  coverage 
problem  with  single-parameter  demand  for  which  there  is  a  known  constant-approximation-ratio 
algorithm  [Khuller,  Moss,  andNaor,  1999]. 


13.1.2  Related  Work 

In  this  work  we  focus  on  the  case  of  decreasing  marginal  cost.  There  have  been  a  large  body 
of  research  devoted  to  unlimited  supply,  which  is  implicitly  constant  marginal  cost  (e.g.,  [Nisan, 
Roughgarden,  Tardos,  and  Vazirani,  2007]  Chapter  13),  where  the  goal  is  to  achieve  a  constant 
competitive  ratio  in  both  offline  and  online  models.  The  case  of  increasing  marginal  cost  was 
studies  in  [Blum,  Gupta,  Mansour,  and  Sharma,  2011]  where  constant  competitive  ratio  where 
given. 

We  analyze  an  online  setting  where  buyers  arrive  one  at  a  time,  sampled  iid  from  some 
unknown  underlying  distribution  over  valuations.  Other  related  online  problems  with  stochastic 
inputs  such  as  matching  problems  have  been  studied  in  ad  auctions  [Goel  and  Mehta,  2008, 
Mehta,  Saberi,  Vazirani,  and  Vazirani,  2007],  Algorithmically,  our  work  is  related  to  the  online 
set  cover  body  of  work  where  [Alon,  Awerbuchy,  Azarz,  Buchbinder,  and  Naor,  2009]  gave  the 
first  O(logmlogn)  competitive  algorithm  (here  n  is  the  number  of  elements  in  the  ground  set 
and  m  is  size  of  a  family  of  subsets  of  the  ground  set).  The  problems  we  study  are  also  related  to 
online  matching  problems  [Devanur  and  Hayes,  2009,  Devanur  and  Jain,  2012,  Karp,  Vazirani, 
and  Vazirani,  1990]  in  the  iid  setting;  however  our  problem  is  a  bit  like  the  “opposite”  of  online 
matching  in  that  the  cumulative  cost  curve  for  us  is  concave  rather  than  convex. 
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13.2  Model,  Definitions,  and  Notation 


We  have  a  set  X  of  r  items.  We  have  a  set  Ar  =  {1 , ,n}  indexing  n  unit  demand  buyers.  Our 
setting  can  then  generally  be  formalized  in  the  following  terms. 

13.2.1  Utility  Functions 

Each  buyer  j  e  N  has  a  weight  uhr  for  each  item  i  el.  We  suppose  the  vectors  are  sampled 
i.i.d.  according  to  a  fixed  (but  arbitrary  and  unknown)  distribution.  In  the  online  setting  we  are 
interested  in,  the  buyers’  weight  vectors  uJ:.  are  observed  in  sequence,  and  for  each  one  (before 
observing  the  next)  we  are  required  to  allocate  a  set  of  items  Tj  C  I  to  that  buyer.  The  utility 
of  buyer  j  for  this  allocation  is  then  defined  as  Uj(Tj)  =  max^^.  Ujj.  A  few  of  our  results 
consider  a  slight  variant  of  this  model,  in  which  we  are  only  required  to  begin  allocating  goods 
after  some  initial  o{n)  number  of  customers  has  been  observed  (to  whom  we  may  allocate  items 
retroactively). 

This  general  setting  is  referred  to  as  the  weighted  unit  demand  setting.  We  will  also  be 
interested  in  certain  special  cases  of  this  problem.  In  particular,  many  of  our  results  are  for  the 
uniform  unit  demand  setting,  in  which  every  j  e  N  and  i  e  X  have  uhl  e  {0, 1}.  In  this  case, 
we  may  refer  to  the  set  Sj  =  {i  e  1 :  uhl  =  1}  as  the  list  of  items  buyer  j  wants  (one  of). 

13.2.2  Production  cost 

We  suppose  there  are  cumulative  cost  functions  cost,  :  N  — >•  [0,  oo]  for  each  item  i  e  X,  where 
for  t  e  N,  the  value  of  cost  ft)  represents  the  cost  of  producing  t  copies  of  item  i.  We  suppose 
each  costj(-)  is  nondecreasing. 

We  would  like  to  consider  the  case  of  decreasing  marginal  cost ,  where  t  *->■  cost  ft  +  1)  — 
costj(t)  is  nonincreasing  for  each  i  el. 

A  natural  class  of  decreasing  marginal  costs  we  will  be  especially  interested  in  are  of  the 
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form  t  “  for  a  E  [0, 1).  That  is,  costj(f)  =  c0  Y1t= i  r  “■ 

13.2.3  Allocation  problems 

After  processing  the  n  buyers,  we  will  have  allocated  some  set  of  items  T,  consisting  of  mfT)  = 
I Tj  if)  copies  of  each  item  i  E  X.  We  are  then  interested  in  two  quantities  in  this  setting: 
the  total  ( production )  cost  cost  (T)  =  cx-xst,  (///,  (W) )  and  the  social  welfare  SW(T )  = 

Uj(Tj)- 

We  are  interested  in  several  different  objectives  within  this  setting,  each  of  which  is  some 
variant  representing  the  trade-off  between  reducing  total  production  cost  while  increasing  social 
welfare. 

In  the  allocate  all  problem,  we  have  to  allocate  to  each  buyer  j  E  N  one  item  i  E  Sj  (in  the 
uniform  demand  setting):  that  is,  SW(T )  =  n.  The  goal  is  to  minimize  the  total  cost  cost(T), 
subject  to  this  constraint. 

The  allocate  with  budget  problem  requires  our  total  cost  to  never  exceed  a  given  limit  h  (i.e., 
cost(T)  <  h ).  Subject  to  this  constraint,  our  objective  is  to  maximize  the  social  welfare  SW (T). 
For  instance,  in  the  uniform  demand  setting,  this  corresponds  to  maximizing  the  number  of 
satisfied  buyers  (that  get  an  item  from  their  set  Sf). 

The  objective  in  the  maximize  social  surplus  problem  is  to  maximize  the  difference  of  the 
social  welfare  and  the  total  cost  (i.e.,  SW (T)  —  cost(T)). 


13.3  Structural  Results  and  Allocation  Policies 

We  now  present  several  results  about  the  structure  of  optimal  (and  non-optimal  but  “reasonable”) 
solutions  to  allocation  problems  in  the  setting  of  decreasing  marginal  costs.  These  will  be  impor¬ 
tant  in  our  sample-complexity  analysis  because  they  allow  us  to  focus  on  allocation  policies  that 
have  inherent  complexity  that  depends  only  on  the  number  of  items  and  not  on  the  number  of 
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customers ,  allowing  for  the  use  of  uniform  convergence  bounds.  That  is,  a  small  random  sample 
of  customers  will  be  sufficient  to  uniformly  estimate  the  performance  of  these  policies  over  the 
full  set  of  customers. 

13.3.1  Permutation  and  pricing  policies 

A  permutation  policy  has  a  permutation  n  over  X  and  is  applicable  in  the  case  of  uniform  unit 
demand.  Given  buyer  j  arriving,  we  allocate  to  him  the  minimal  (first)  demanded  item  in  the 
permutation,  i.e.,  arginin^  n (i).  A  pricing  policy  assigns  a  price  price;  to  each  item  i  and  is 
applicable  to  general  quasilinear  utility  functions.  Given  buyer  j  arriving,  we  allocate  to  him 
whatever  he  wishes  to  purchase  at  those  prices,  i.e.,  arg  rnaxT)  Uj  (Ty )  —  ^2ieT  price*.2 

We  will  see  below  that  for  uniform  unit  demand  buyers,  there  always  exists  a  permutation 
policy  that  is  optimal  for  the  allocate-all  task,  and  for  general  quasilinear  utilities  there  always 
exists  a  pricing  policy  that  is  optimal  for  the  task  of  maximizing  social  surplus.  We  will  also 
see  that  for  weighted  unit  demand  buyers,  there  always  exists  a  pricing  policy  that  is  optimal 
for  the  allocate-with-budget  task;  moreover,  for  any  even  non-optimal  solution  (e.g.,  that  might 
be  produced  by  a  polynomial-time  algorithm)  there  exists  a  pricing  policy  that  sells  the  same 
number  of  copies  each  item  and  has  social  welfare  at  least  as  high  (and  can  be  computed  in 
polynomial  time  given  the  initial  solution). 

13.3.2  Structural  results 

Theorem  13.1.  For  general  quasilinear  utilities,  any  allocation  that  maximizes  social  surplus 
can  be  produced  by  a  pricing  policy.  That  is,  ifT  =  (T), . . . ,  Tn }  is  an  allocation  maximizing 
SW(T )  —  cost(T)  then  there  exist  prices  price1, . . . ,  price*,  such  that  buyers  purchasing  their 
most-demanded  bundle  recovers  T,  assuming  that  the  marginal  cost  function  is  strictly  decreas- 

2When  more  that  one  subset  is  applicable,  we  assume  we  have  the  freedom  to  select  any  such  set. 
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Proof.  Consider  the  optimal  allocation  OPT.  Define  price,;  to  be  the  marginal  cost  of  the  next 
copy  of  item  i  under  OPT,  i.e.,  price^  =  costj(#j(OPT)  + 1).  Suppose  some  buyer  j  is  assigned 
set  Tj  in  OPT  but  prefers  set  Tj  under  these  prices.  Then, 

Uj(T')  -  Y  Price*  >  uATi)  ~  Y  price*’ 

ieTj 

which  implies 

HjiTj)  -  HjiTj)  -  Y  Pricei-  Y  Pricei  >  0-  (1.3.1) 

i<  r,\T‘  i(  T'I\T; 

Now,  consider  modifying  OPT  by  replacing  Tj  with  Tj.  This  increases  buyer  j’ s  utility  by 
Uj(Tj )  —  Uj(Tj),  incurs  an  extra  purchase  cost  exactly  I)ri('et  and  a  savings  of  strictly 

more  than  '}f/ieT^T,  price^  (because  marginal  costs  are  decreasing).  Thus,  by  (13.1)  this  would 
be  a  strictly  preferable  allocation,  contradicting  the  optimality  of  OPT.  □ 

Corollary  13.2.  For  uniform  unit  demand  buyers  there  exists  an  optimal  allocation  that  is  a 
permutation  policy,  for  the  allocate  all  task. 


Proof  Imagine  each  buyer  j  had  valuation  vmax  on  items  in  S3  where  vmax  is  greater  than  the 
maximum  cost  of  any  single  item.  The  allocation  OPT  that  maximizes  social  surplus  would 
then  minimize  cost  subject  to  allocating  exactly  one  item  to  each  buyer  and  therefore  would 
be  optimal  for  the  allocate-all  task.  Consider  the  pricing  associated  to  this  allocation  given  by 
Theorem  13.1.  Since  each  buyer  j  is  uniform  unit  demand,  he  will  simply  purchase  the  cheapest 
item  in  Sr  Therefore,  the  permutation  n  that  orders  items  according  to  increasing  price  according 
to  the  prices  of  Theorem  13.1  will  produce  the  same  allocation.  □ 

We  now  present  a  structural  statement  that  will  be  useful  for  the  allocate-with-budget  task. 

3If  the  marginal  cost  function  is  only  non-increasing,  we  can  have  the  same  result,  assuming  we  can  select 
between  the  utility  maximizing  bundles. 
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Theorem  13.3.  For  weighted  unit-demand  buyers,  for  any  allocation  T  there  exists  a  pricing 
policy  that  allocates  the  same  multiset  of  items  T  (or  a  subset  ofT)  and  has  social  welfare  at 
least  as  large  as  T.  Moreover,  this  pricing  can  be  computed  efficiently  from  T  and  the  buyers’ 
valuations. 

Proof  Let  T  be  the  multiset  of  items  allocated  by  77  Weighted  unit-demand  valuations  satisfy 
the  gross-substitutes  property,  so  by  the  Second  Welfare  Theorem  (e.g.,  see  [Nisan,  Roughgar- 
den,  Tardos,  and  Vazirani,  2007]  Theorem  11.15)  there  exists  a  Walrasian  equilibrium:  a  set  of 
prices  for  the  items  in  T  that  clears  the  market.  Moreover,  these  prices  can  be  computed  effi¬ 
ciently  from  demand  queries  (e.g.,  [Nisan,  Roughgarden,  Tardos,  and  Vazirani,  2007],  Theorem 
11.24),  which  can  be  evaluated  efficiently  for  weighted  unit-demand  buyers.  Furthermore,  these 
prices  must  assign  all  copies  of  the  same  item  in  T  the  same  price  (else  the  pricing  would  not  be 
an  equilibrium)  so  it  corresponds  to  a  legal  pricing  policy.  Thus,  we  have  a  legal  pricing  such 
that  if  all  buyers  were  shown  only  the  items  represented  in  T,  at  these  prices,  then  the  market 
would  clear  perfectly  (breaking  any  ties  in  our  favor).  We  can  address  the  fact  that  there  may  be 
items  not  represented  in  T  (i.e.,  they  had  zero  copies  sold)  by  simply  setting  their  price  to  infinity. 
Finally,  by  the  First  Welfare  Theorem  (e.g.,  [Nisan,  Roughgarden,  Tardos,  and  Vazirani,  2007] 
Theorem  11.13),  this  pricing  maximizes  social  welfare  over  all  allocations  of  T,  and  therefore 
achieves  social  welfare  at  least  as  large  as  77  as  desired.  □ 

The  above  structural  results  will  allow  us  to  use  the  following  sketch  of  an  online  algorithm. 
First  sample  an  initial  set  of  £  buyers.  Then,  for  the  allocate-all  problem,  compute  the  best 
(or  approximately  best)  permutation  policy  according  to  the  empirical  frequencies  given  by  the 
sample.  Or,  for  the  allocate-with  budget  task,  compute  the  best  (or  approximately  best)  allocation 
according  to  these  empirical  frequencies  and  convert  it  into  a  pricing  policy.  Then  run  this 
permutation  or  pricing  policy  on  the  remainder  of  the  customers.  Finally,  using  the  fact  that 
these  policies  have  low  complexity  (they  are  lists  or  vectors  in  a  space  that  depends  only  on  the 
number  of  items  and  not  on  the  number  of  buyers)  compute  the  size  of  initial  sample  needed  to 
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ensure  that  the  estimated  performance  is  close  to  true  performance  uniformly  over  all  policies  in 
the  class. 


13.4  Uniform  Unit  Demand  and  the  Allocate- All  problem 

Here  we  consider  the  allocate-all  problem  for  the  setting  of  uniform  unit  demand.  For  intuition, 
we  begin  by  considering  the  following  simple  class  of  decreasing  marginal  cost  curves. 
Definition  13.4.  We  say  the  cost  function  cost  (t)  is  a-poly  if  the  marginal  cost  of  item  t  is  l/ta 
fora  G  [0, 1).  That  is,  cost(t)  =  Y^t=i  l/r“. 

Theorem  13.5.  If  each  cost  function  is  a-poly,  then  there  exists  an  efficient  offline  algorithm  that 
given  a  set  X  of  buyers  produces  a  permutation  policy  that  incurs  total  cost  at  most  -^77  OPT. 

Proof  We  run  the  greedy  set-cover  algorithm.  Specifically,  we  choose  the  item  desired  by  the 
most  buyers  and  put  it  at  the  top  of  the  permutation  7 r.  We  then  choose  the  item  desired  by 
the  most  buyers  who  did  not  receive  the  first  item  and  put  it  next,  and  so  on.  For  notational 
convenience  assume  7r  is  the  identity,  and  let  S,  denote  the  set  of  buyers  that  receive  item  i.  For 
any  set  S  C  X,  let  OPT(iS)  denote  the  cost  of  the  optimal  solution  to  the  subproblem  S  (i.e.,  the 
problem  in  which  we  are  only  required  to  cover  buyers  in  5).  Clearly  OPT(iSr)  =  cost(|iSr|)  = 
Yl^r=i  l/r°  —  x~adx  =  77^|<Sv|1_a  —  1,  since  any  solution  using  more  than  one  set 

to  cover  the  elements  of  Sr  has  at  least  as  large  a  cost. 

Now,  for  the  purpose  of  induction,  suppose  that  some  k  G  {2 , ,  r}  has  OPT(|J'=A,  St )  > 
Ylt=k  |£<|1_a-  Then,  since  was  chosen  to  be  the  largest  subset  of  Uf=fc_i  that  can  be 
covered  by  a  single  item,  it  must  be  that  the  sets  used  by  any  allocation  for  the  lJt=fc-i  sub¬ 
problem  achieving  OPT([J[=fc_1  St)  have  size  at  most  \Sk- 1 1,  and  thus  the  marginal  costs  for 
each  of  the  elements  of  Sk-i  in  the  OPT(lJfr=fc_1  St)  solution  is  at  least  l/\Sk-i\a. 

This  implies  OPT(UU_,  S.)  >  OPT(UUS.)  +  E«stl  =  OPTftJUS,)  + 

|<Sfe_i  1 1_“.  By  the  inductive  hypothesis,  this  latter  expression  is  at  least  as  large  as  J2t=k- 1  I 1 1  “• 


291 


By  induction,  this  implies  OPT(A")  =  OPT(U£=1  St)  >  Y^t=\  |‘Si|1_a-  On  the  other  hand, 
the  total  cost  incurred  by  the  greedy  algorithm  is  Ylt=i  Vr°  —  Y^t=i  x~a^x  = 
Ylt= i  l^|1_Q-  By  the  above  argument,  this  is  at  most  j^OPT(A").  □ 


More  general  cost  curves  We  can  generalize  the  above  result  to  a  natural  class  of  smoothly  de¬ 
creasing  cost  curves.  Define  the  average  cost  of  item  i  given  to  set  S,  of  buyers  as  AvgC  (  /',  |<S,;|  )  = 
ccff  I) .  Define  the  marginal  cost  M arC(i,  t )  =  costj(f)  —  costj(f  —  1).  Here  is  a  greedy  algo¬ 
rithm. 

Algorithm:  GreedyGeneralCost(S ) 

0.  i  =  arg  min  AvgC {i ,  | S,  | ) 

1.  Call  GreedyGener alC ost(S  —  Si) 

We  make  the  following  assumption: 

Assumption  13.6.  AvgC(i,t)  <  f3MarC{i,t),  for  some  f3  >  0. 

For  example,  for  the  case  of  an  a- poly  cost,  we  have:  MarC(t)  =  A  and  AvgC  = 
7  i  4  ~  44  so,  therefore  we  have  8  =  A—. 

t  't=L  ra  1—a  ’  ’  a  l— a 

Theorem  13.7.  The  algorithm  GreedyGeneralCost  achieves  approximation  ratio  (3. 


Proof.  Order  the  elements  in  the  order  that  GreedyGeneralCost  allocates  them.  Let  Nj  be  the 
set  of  consumers  that  receive  item  j,  and  N  =  UAr,  in  GreedyGeneralCost.  For  consumer 
i  let  itemopt{i )  be  the  item  that  OPT  allocates  to  consumer  i.  Let  (:opl  (j )  be  the  number  of 
consumers  that  are  allocated  item  j.  By  Assumption  13.6  we  have  MarC(j ,  l )  <  AvgC(j.  I )  < 
/ 3MarC(j,  l )  (the  first  inequality  is  due  to  having  decreasing  marginal  cost). 
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We  would  like  to  consider  the  influence  of  the  consumers  in  Ni  on  the  cost  of  OPT.  Let 


OPT(iV)  -  OPT(iV  -  >  YsicN!  MarC{itemopt{i),  £opt(iternopt(i))) 

>  ^2ieNl  ^AvgC<ftemopt(i),lopt(itemopt(i))) 

>  jj\Ni\AvgC(l,  | | )  =  ^GreedyCost(Ni) 


The  first  inequality  follows  since  taking  the  final  marginal  cost  can  only  reduce  the  cost  (decreas¬ 
ing  marginal  cost).  The  second  inequality  follows  from  Assumption  13.6.  The  third  inequality 
follows  since  GreedyGeneralCost  selects  the  lowest  average  cost  of  any  allocated  item  . 

We  can  now  continue  inductively.  Let  T0  —  N,T\  —  N  —  Nu  and  T,  =  Tt_  \  —  N We  can 
show  similarly  that, 

OPT(T,_1)  -  OPT (Ti)  >  ^ GreedyCost{Ni ) 

P 

Summing  over  all  i  we  have 


OPT(T)  -  OPT(0)  =  OPT(Tj_i)  -  OPT  (T*)  >  4  5^  GreedyCost(Ni) 

P 


=  —GreedyCost(N) 


□ 


Corollary  13.8.  If  the  cost  function  is  a-poly,  then  for  (3  =  A—,  Assumption  13.6  holds.  Thus 

GreedyCost(Sj)  ^  1 

OPTCost(Sj)  —  1-a' 

Additionally,  the  following  property  is  satished  for  these  V-nicc  cost  functions. 

Lemma  13.9.  For  cost  satisfying  Assumption  13.6,  Vx  e  N,  Ve  G  (0, 1),  Vi  <  r,  cost* (ex)  < 
eiog2(i+^)cost.(x)_ 


Proof.  By  the  fact  that  marginal  costs  are  non-negative,  AvgC( 2cx)  >  costj(ex)/(2ex).  There¬ 
fore,  by  Assumption  13.6,  MarCflex)  >  cost,;(ex)/(2 ex/3).  By  the  decreasing  marginal  cost 
property,  we  have 


costj(2ex)  >  cost j (ex)  +  exMarC( 2ex)  >  cost* (ex)  +  cost* (ex)/ (2 j3) 


(!  +  ^g)costi(ex). 
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Applying  this  argument  log2(l/e)  times,  we  have 


cost*  (a;)  >  (1  +  -r 


(1  +  -^-)log2<'1/^costj(ea;)  =  (-)log2^1+2'3-)costj(ea;). 
z  p  a 


Multiplying  both  sides  by  elog2^1+2^  completes  the  proof. 


13.4.1  Generalization  Result 

Say  n  is  the  total  number  of  customers;  t  is  the  size  of  subsample  where  we  do  estimate  on; 
r  is  the  total  number  of  items;  a  €  (0, 1]  is  some  constant,  and  the  cost  is  a-poly,  so  that 
cost(t)  =  il/r°  —  Jo  y  a<iy  —  7-77  =  7-77-  We  have  the  following  generalization 

result: 

Theorem  13.10.  Suppose  n  >  t  and  the  cost  function  is  a-poly.  With  probability  at  least  1  —  d(t>, 
for  any  permutations  II, 

cost(n,  £)(1  +  e)~2  <  cost(II, n)  <  cost(II, £){1  +  e)2(1-a)  , 

where  6^  =  r2r(5i  +  52  +  £3)  and  5i  =  exp{— e2  (7)  1_“  n/3},  S2  =  exp{  — e2£  (7)  /3}, 

<53  =  exp{—  (j)  ^  ne2/ 2}. 

Proof  Fix  a  permutation  II.  Let  nj  denote  the  event  that  a  customer  buys  item  I  I;y  and  not 
covered  by  items  through  H  y-_ , .  Namely,  the  probability  that  the  consumer  set  of  desired 
items  include  j  and  none  of  the  items  1 ....  ,  7  —  1.  Let  qt  denote  Pr[nj],  and  let  q:j  denote  the 
fraction  of  IF,  on  the  initial  f-sample. 

1 

Item  j  to  is  a  “Low  probability  item”  if  q.j  <  (7)  1  ;  and  “High  probability  items”  if  qj  > 

1 

(7)  1_“ .  Let  the  set  “Low”  include  all  “Low  probability  items”;  and  the  set  “High”  include  all 
“High  probability  items”. 

First  we  address  the  case  of  item  j  of  low  probability.  The  quantity  of  item  j  that  we 
will  sell  is  at  most  (7)  1  “  n(  1  +  e)  (Chernoff  bound)  with  probability  at  least  1  —  5\  with 
S]  =  exp{— e2  (7)  1-“  n/3}.  By  a  union  bound,  this  holds  for  all  low  probability  item  j,  with 
probability  at  least  1  —  |Low|(5i . 
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Next,  we  suppose  j  has  high  probability.  In  this  case,  the  quantity  of  item  j  we  will  sell  is  at 
most  qfn(  1  +  e),  with  probability  at  least  1  —  exp{— e2qjfi/3}  >  1  —  6i.  Again,  a  union  bound 
implies  this  holds  for  all  high  probability  j  with  probability  at  least  1  —  (Highly. 

We  have  that  (by  Chernoff  bounds),  with  probability  at  least  1  —  exp{— e2lqj/3}  >  1  —  <52,  we 
have  qj/qj  <  (1  +  e) .  A  union  bound  implies  this  holds  for  all  high  probability  j  with  probability 
1  -  rS2. 

Furthermore,  noting  that  gJn(  1  +  e)  =  q:jn(  1  +  e)^,  and  upper  bounding  by  1  +  e,  we  get 
that  qjn(  1  +  e)  <  (1  +  e)2qjn,  with  probability  1  —  52.  Thus, 


cost(n,n)  <  cost  (Low)  +  cost  (High) 

1— a 


<  r 


e  \  i-o 


2-  1-0 


n(l  +  e)  +  W1  +  e)2^'n) 


jGHigh 


<  e(l  +  e)1-V-a  +  (l  +  e)2(1-a)n1-a  ^  (^-)1_a  • 

j  GHigh 

Note  that  the  total  cost  of  all  low  probability  items  is  at  most  e-fraction  of  OPT  which  is  at  least 


S-A  Also, 

1—a  7 


(1  +  e)2(1-“>n1-Q  ^  {Qj)l  a 

j& High 


j 

(1  +  e)2(1~a)  Q)  cost(n,  i) 


by  dehnition  of  cost(n,  £). 

Therefore  we  showed, 

cost(n.n)  <  e(l  +  e)1-QT1-“  (j'j  +  (1  +  e)2(1_a)  (j'J  cost  (11,^) 

/  ry\  1— ot 

<  (l  +  5e)^-J  cost(n,£) 

The  lower  bound  is  basically  similar.  For  j  e  Low,  we  have  qt  <  and  qj  < 
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(^)  1-“  (1  +  e)  (by  Chernoff  bounds).  So  we  have 

3  3  ^ 

=  r-(l  +  e)1_QT1_“ 
r 

=  £(i+£)‘-v-“^y  ‘ 

<  e(l  +  e)1_acost(II,  n) 

Thus, 


cost(n,£)  = 


< 


< 


< 


£  («i01-“+  £  «/)'■" 

jgLow  jeHigh 

cost(n,n)e  (l  +  e)1~Q+  ^  (g/n)1”" 

W  ieHigh 

cost(II.  n)e  (1  +  e)  +  ^  ( qjn )1_a  ^ 


/eHigh 


(1  +  e)2cost(n.  n) 


1—a 

(1  +  e) 


with  probability  at  least  1  —  exp  {— gJne2/2}  >  1  —  S3.  For  low-probability  j,  the  number  of 
item  j  sold  is  >  (-)  1_“  n(  1  —  e)  with  probability  at  least  1  —  83.  A  union  bound  extends  these 
to  all  j  with  combined  probability  1  —  rS3. 

Thus  we  obtain  the  upper  bound:  cost(n,n)  <  cost(n,£)(l  +  e)2(1  (f)1  “  and  the  lower 

bound:  cost  (IT,  n)  >  cost  (II,  £)(  1  +  e)-2  (|)X  with  probability  at  least  1  —  r2r(<51  +  S2  +  £3). 

A  naive  union  bound  can  be  done  over  all  the  permutations,  which  will  add  a  factor  of  r!, 
we  can  reduce  the  factor  to  r2r  by  noticing  that  we  are  only  interested  in  events  of  the  type  7 Tj, 
namely  a  given  item  (say,  j )  is  in  the  set  of  desired  items,  and  another  set  (say,  {1, . . . ,  j  —  1})  is 
not  in  that  set.  This  has  only  r2r  different  events  we  need  to  perform  the  union  over.  □ 
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13.4.2  Generalized  Performance  Guarantees 


We  define  GreedyGeneralC  ost(£,n )  as  follows.  For  the  first  £  customers  it  allocates  arbi¬ 
trary  items  they  desire,  and  observed  their  desired  sets.  Give  the  sets  of  the  first  £  customers, 
it  runs  GreedyGeneralC ost  and  computes  a  permutation  II  of  the  items.  For  the  remaining 
customers  it  allocates  using  permutation  IT.  Namely,  each  customer  is  allocated  the  first  item  in 
the  permutation  II  that  is  in  its  desired  set.  The  following  theorem  bounds  the  performance  of 
GreedyGeneralC ost (£,  n)  for  a-poly  cost  functions. 

Theorem  13.11.  With  probability  1  —  d(i)  (for  r)(  f  >  as  in  Theorem  13.10),  the  cost  of 
GreedyGeneralC ost(£ ,  n)  is  at  most 


£  + 


(1  +  e)4~2a 
1  —  a 


OPT 


Proof  Let  A  be  the  permutation  policy  produced  by  GreedyGeneralCost,  after  the  £  first  cus¬ 
tomers.  By  Theorem  13.7, 


cost(fl,  £)  < 


- mincostfll,  £). 

1  -  a  ri 


By  Theorem  13.10,  with  probability  1  —  S^\ 


min cost(II, £)  <  min cost(II, n)(l  +  e)2 


Additionally,  on  this  same  event, 


cost(fl,n)  <  cost(II,  f)(l  +  e)2^1 


Altogether,  this  implies 


^  ('\  4-  /  77  \  1— Q; 

cost(fl,  n)  <  - - - (  — )  min  cost (II,  n)(l  +  e)2  (  — 


1  —  a 
(1  +  e)4-2“ 


n 


n 


1 —a 


1  -  a  n 


min  cost  (II,  n) 


□ 
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Corollary  13.12.  For  any  fixed  constant  5  G  (0, 1),  for  any 


3,2' 


e2  ve 


5 


and 


n  >  (  - 
e 


w/f/z  probability  at  least  1  —  5  we  have  GreedyGeneralCost(n,  £)  is  at  most 

(1  +  e)4-2“ 


1  —  a 


+  e  OPT 


13.4.3  Generalization  for  /3-nice  costs 

Toward  extending  the  offline-model  results  under  Assumption  13.6  to  the  online  setting,  consider 
the  following  lemma. 

Lemma  13.13.  For  any  cost  cost  satisfying  Assumption  13.6  with  a  given  (3,  for  any  k  >  1,  the 
cost  cost'  with  cost' (it.)  =  cost  fkx)  cdso  satisfies  Assumption  13.6  with  the  same  f3. 


Proof. 

cost fkx)  ,  cost fkx)  .. 

- =  k - - - <  fikicostAkx)  —  cost  Akx  —  1)). 

x  kx 

Also,  the  property  of  nonincreasing  marginal  costs  implies  Vt  G  {1, . . . ,  k}, 

cost  fkx)  —  cost  fkx  —  1)  <  cost  fkx  —  (t  —  1))  —  cost  *  (kx  —  t ), 


so  that 


k 

k(costi(kx)—costi(kx—l))  <  ^^(costi(kx—(t—l))—costj(kx—t))  =  costi(kx)—costi(k(x—l)). 

t= i 


Therefore, 

cost  Akx)  n.  ,,  s  ,,  .  , ... 

- <  ^(cost fkx)  —  cost i(k(x  —  1))). 


□ 
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Now  the  strategy  is  to  run  GreedyGeneralCost  with  the  rescaled  cost  function  cost' (a;)  = 
costj(|a;).  This  provides  a  /^-approximation  guarantee  for  the  rescaled  problem.  The  following 
theorem  describes  the  generalization  capabilities  of  this  strategy. 

Theorem  13.14.  Suppose  n  >  £  and  the  cost  function  satisfies  Assumption  13.6,  and  that  \/i, 
costj(l)  G  [1,  B],  where  B  >  1  is  constant.  Let  cost'(a;)  =  costj(^x).  With  probability  at  least 
1  —  d,:f  },  for  any  permutations  II, 

cost'(n,  t) - -  <  cost(n,n)  <  cost'(II,G- — ^  , 

1  +  2e  —  ez  '  1  —  e 

where  5 w  =  r22r+1(<51  +  S2),  <5i  =  exp{—e3n°S2<'1+^/(3rB(l  +  e))},  and 

S2  =  exp{— e2f^gp^ynlog2<-1+2?^_1/3}.  It  is  not  necessary  for  the  set  of  £  customers  to  be 

con  tained  in  the  set  of  n  customers  for  this. 

Proof.  Fix  a  permutation  II.  Let  tt?  denote  the  event  that  a  customer  buys  item  IT,-  and  not 
covered  by  items  ILi  through  n,_  i .  Namely,  the  probability  that  the  consumer  set  of  desired 
items  include  j  and  none  of  the  items  1, ...  ,j  —  1.  Let  qj  denote  Pr\nj],  and  let  qj  denote  the 
fraction  of  Ilj  on  the  initial  /-sample. 

Let  q*  =  rB^l+^nc~l ,  where  c  =  log2(l  +  ^).  Item  j  is  a  “Low  probability  item”  if  q3  <  q*, 
and  is  called  a  “High  probability  item”  if  qj  >  q*.  Let  the  set  “Low”  include  all  “Low  probability 
items”;  and  the  set  “High”  include  all  “High  probability  items”. 

First  we  address  the  case  of  item  j  of  low  probability.  By  a  Chernoff  bound,  the  quantity  of 
item  j  that  we  will  sell  when  applying  n  to  n  customers  is  at  most  q*n{  1  +  e),  with  probability 
at  least  1  —  exp{— e2g*n/3}  =  1  —  Zq.  By  a  union  bound,  this  holds  for  all  low  probability  items 
j  with  probability  at  least  1  —  (Lowl/p . 

Next,  suppose  j  has  high  probability.  In  this  case,  the  quantity  of  item  j  we  will  sell  when 
applying  n  to  n  customers  is  at  most  q:jn(l  +  e),  with  probability  at  least  1  —  exp{— e2g/n/3}  > 
1  —  5\.  Again,  a  union  bound  implies  this  holds  for  all  high  probability  j  with  probability  at  least 
1  -  |  High  1 5i. 
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We  have  that  (by  Chernoff  bounds),  with  probability  at  least  1  —  exp{  — 62(V/J/3}  >  1  —  62,  we 
have  qj/qj  <  (1  +  e) .  A  union  bound  implies  this  holds  for  all  high  probability  j  with  probability 
1  -  rd2. 

Furthermore,  noting  that  qjn{  1  +  e)  =  q:)n{  1  +  e)^,  and  upper  bounding  jP-  by  1  +  e,  we  get 
that  qjn(  1  +  e)  <  (1  +  efqqn,  with  probability  at  least  1  —  52.  Thus,  with  probability  at  least 
1  -  r8  i  -  rS  2, 


cost(II,n)  <  cost  (Low)  +  cost  (High) 

<  costj  (q*n(l  +  e))  +  costj  ((1  +  e)2qjrn ) 

j'eLow  j’GHigh 

<  rBq*n(  1  +  e)  +  (1  +  e)2  ^  cost.,-  (g/n) 

jGHigh 

=  rBq*n(  1  +  e)  +  (1  +  e)2  ^  cost'(II,  £). 

jGHigh 

Note  that  Lemma  13.9  (with  e  =  1/x )  implies  that  on  n  customers,  OPT  >  min.,  costj(n)  > 
nlog2(1+2?)  rnirij  costal)  >  nog^1+^  =  nc,  where  the  third  inequality  is  by  the  assumption  on 
the  range  of  costj(l).  Thus,  rBq*n(  1  +  e)  =  enc  <  eOPT. 

We  showed  that 

cost(II,  n)  ^  eOPT  T  (1  T  e)  ^  )  cost^-  (H,€) 

jGHigh 

<  ecost(II,  n)  +  (1  +  e)2Ejeffighcost'  (n,£). 


Therefore, 


cost(n,n)  < 
< 


H  cost'(n,f) 

j  GHigh 

^1  +  ^2cost/(n,^). 


The  lower  bound  is  basically  similar.  For  j  e  Low,  a  Chernoff  bound  implies  we  have 
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qj  <  g*(l  +  e)  with  probability  at  least  1  —  exp{— e2q*£/3}  >  1  —  S2.  So  we  have 

costj(gyn,)  <  cost,(g*(l  +  e)n) 

j'GLow  j’GLow 

<  rB(l  +  e)q*n 
=  enc 

<  eOPT 

<  ecost(II,  n) . 


For  j  e  High,  again  by  a  Chernoff  bound,  we  have  qj/qj  <  (1  +  e)  with  probability  at  least 
1  —  exp{— e2qj£/3}  >1  —  52-  Thus,  by  a  union  bound,  with  probability  at  least  1  —  r5 2, 


cost'  (n.O  =  £  costj(g/n,)  +  costj(g/n,) 

jGLow  jGHigh 

<  ecost(II,n)+  costj(gjn(l  +  e)). 

jGHigh 

By  another  application  of  Chernoff  and  union  bounds,  with  probability  at  least  1— JAgHiah  exp{— e2gyn,/2}  > 
1  —  rS],  for  every  j  e  High,  the  number  of  j  we  will  sell  when  applying  n  to  n  customers  is  at 
least  g*n(  1  —  e).  Thus, 


E  costj(gXl  +  e))  =  E  costj(g/n(l  e)  )  <  xE  costj(gjn(l  —  e))  <  cost(n, n). 


1-e'  ~  1  -  e 

jGHigh  jGHigh  jGHigh 

Altogether,  we  have  proven  that  with  probability  at  least  1  —  r(Si  +  d2). 


cost'(n,  f)  <  ^  ^  cost(n,  n) 


l  +  2e-ez 
1  —  e 


cost(n,  n) , 


which  implies 

- — \  e  cost'(n,l)  <  cost(n,  n). 

1  +  2e  —  ez 

A  naive  union  bound  can  be  done  over  all  the  permutations,  which  will  add  a  factor  of  H; 
we  can  reduce  the  factor  to  r2r  by  noticing  that  we  are  only  interested  in  events  of  the  type  nJ7 
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namely  a  given  item  (say,  j )  is  in  the  set  of  desired  items,  and  another  set  (say,  {1, . . . ,  j  —  1})  is 
not  in  that  set.  This  has  only  rT  different  events  we  need  to  perform  the  union  over.  Thus,  the 
above  inequalities  hold  for  all  permutations  with  probability  at  least  1  —  r22r+l  (V),  +  52).  □ 


1 

Let  n0  =  0,  rii  =  2  ^  3rBB+c'>  ]n  ^4r22r+2jj  i°«2(i+^)  _  por  integer  i>2,  define 

2>rB(l  +  e)  In 

We  define  GreedyGeneralCost^n )  as  follows.  Allocate  arbitrary  (valid)  items  to  the  first 
ri\  customers.  For  each  i  >  2  with  Y^j=i  ni  —  n’  run  GreedyGeneralCost(S )  with  cost  func¬ 
tion  cost',  where  S  is  the  set  of  buyers  1,2,...,  nv  and  Vj,  cost'(a;)  =  cost  j(xrii/  YX=i  nt ); 

this  produces  a  permutation  policy  IT.  We  then  allocate  to  the  customers  (J2'r  =i  nj)+i, . . . .  v;  ,  // , 
using  the  permutation  policy  n. 

The  following  theorem  bounds  the  performance  of  GreedyGeneralCostp^n) . 

Theorem  13.15.  If  cost  satisfies  Assumption  13.6,  and  has  costal)  G  [1,  B]  for  every  j  <  r, 
with  probability  at  least  1  —  5,  the  cost  of  Greedy GeneralCost^n)  is  at  most 


^  1— l°g2(1+  2)5  ) 


Brii  +  ft 


(1  +  e)2(l  +  2e 

(1-f)2 


E  OPT(n.). 

*;E}=i  rij<n 


Proof  By  Theorem  13.7,  Lemma  13.13,  and  Theorem  13.14  and  a  union  bound,  with  probability 
at  least  1—5,  for  every  i,  the  cost  of  GreedyGeneralCosty  on  customers  l+X^-=i  %,•••,  X^=1  nj 
is  at  most 


i—  1 


cost'  n, 


rij 


j= 1 


(1  +  <QS 

1  —  e 


i—  1 


<  B  min  cost'  IT,  > 

—  n  \  ^ 


n,- 


(1  +  eY 

1  —  e 


3= 1 

„  , _  (1  +  e)2(l  +  2e  —  e2) 

<  mm  cost  (IT ,  rii ) - 7 - 777 - 

n  (1  -  e)2 

=  /*(1+e>?+,2re2)0pT(»,). 


(1  —  eY‘ 


Summing  over  i  yields  the  result. 


□ 
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If  we  are  allowed  to  preview  the  utilities  of  some  initial  o(n)  set  of  buyers,  then  we  can  get 
the  following  simpler  result. 

Theorem  13.16.  If  cost  satisfies  Assumption  13.6,  and  has  costj(l)  G  [1,  B]for  every  j  <  r,  with 
probability  at  least  1—5,  the  cost  of  applying  the  policy  found  by  GreedyGeneralCost({l,  ...,£}) 
to  all  n  customers  is  at  most 


(3 


(1  +  e)2(l  +  2e  —  e2) 
(1  -  6)=* 


OPT(n), 


where  I  = 


n 


l-log2(l+^)3rB(l+e) 


In 


22^+2 


=  o(n). 


Proof.  By  Theorem  13.7,  Lemma  13.13,  and  Theorem  13.14,  with  probability  at  least  1  —  5,  the 
cost  of  applying  the  policy  II  found  by  GreedyGeneralC  ost{{  1,  ...,£})  to  customers  1, . . . ,  n 
is  at  most 


cost'(f[,  l)  ^  i+  ^  <  j3  min  cost' (II,  L)^y-— — 


<  [3  iriiri  cost(Il,  n) 

=  P 


(1  +  e)2(l  +  2e  —  e2) 


n  '  '  '  (1 

(1  +  e)2(l  +  2e  —  e2) 


(1  —  e)S 


OPT(n). 


□ 


Also  consider  the  following  lemma. 

Lemma  13.17.  If  cost  satisfies  Assumption  13.6,  then  for  any  n  G  N,  OPT(2n)  >  ^1  +  OPT(n). 

Proof.  □ 

We  define  GreedyGeneralCost'Jn )  as  follows.  Allocate  an  arbitrary  (valid)  item  to  the  first 
customer.  For  each  z  >  1  with  i  <  log 2(n),  run  G  r  e  e dyG ene r  a  I C os  t  ( S ) ,  where  S  is  the  set  of 
buyers  1,2, . . .  this  produces  a  permutation  policy  II.  We  then  allocate  to  the  customers 
2*_1  +  1, . . . ,  2*  using  the  permutation  policy  II. 

The  following  theorem  bounds  the  performance  of  GreedyGeneralCost'Jn) . 
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Theorem  13.18.  If  cost  satisfies  Assumption  13.6,  andhas  costj(l)  G  [1,  B]  for  every  j  <  r,  let¬ 
ting  i  denote  the  smallest  power  of  2  greater  than  ^3' 11  r>  In  ^  lr2'f ' 2  j  j  lo“:2n  +  27?\  with  proba- 

.  ,  .  |  (i-log2(0)l°g2(1+ A) 

bility  at  least  r22r+2  (^2^+2)  ,  the  costofGreedyGeneralCost'^n) 


is  at  most 


Bt  + 


(1  +  e)2(l  +  2e  -  e2)  ,  „,2 


(2/3)2OPT(n). 


(1-6)2 

Proof  By  Theorem  13.7,  Theorem  13.14  and  a  union  bound,  with  the  stated  probability,  for 
every  i  >  log2(f),  the  cost  of  GreedyGeneralCost'p  on  customers  2*_1  +  1, . . . ,  2*  is  at  most 

cost  (ft,  2i_1} 


<  P  mjn  cost  (n,  {1, . . . ,  2*  1}) 


<  P  min  cost(fI,  {2*  +  1, . . . ,  2*}) 


n 


(1  +  e)2(l  +  2e  —  e2) 

(1-e)2 


=  P{1  +  6)2(1  +  22e  e2)QPT({y-1  +  1, . . .  ,2*}). 

(1-e)2 


By  Lemma  13.17, 


OPT({2i_1  +  1, . . . ,  2*})  =  OPT(2,_1)  <  OPT(2n2*-1-riog2(n)1) 

(\  fToga(n)l+l-i 

TTx)  20PT<^ 

Summing  this  over  i  G  {log2(f)  +  1, . . . ,  |~log2(n)]}  is  at  most  4/30PT(n).  Plugging  this  into 
the  above  bound  on  the  cost  supplies  the  stated  result.  □ 


13.5  General  Unit  Demand  Utilities 

In  this  section  we  show  how  to  give  a  constant  approximation  for  the  case  of  general  unit  demand 
buyers  in  the  offline  setting  in  the  case  when  we  have  a  budget  B  to  bound  the  cost  we  incur  and 
we  would  like  to  maximize  the  buyers  social  welfare  given  this  budget  constraint.  The  main  tool 
would  be  a  reduction  of  our  problem  to  the  budgeted  maximum  coverage  problem. 

Definition  13.19.  An  instance  of  the  budgeted  maximum  coverage  problem  has  a  universe  X 
of  m  elements  where  each  xr  G  X  has  an  associated  weight  ivt;  there  is  a  collection  of  m  sets 
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S  such  that  each  sets  Sj  E  S  has  a  cost  Cj;  and  there  is  a  budget  L.  A  feasible  solution  is  a 
collection  of  sets  S'  C  S  such  that  Y^SjeS'  ci  —  The  goal  is  to  maximize  the  weight  of  the 
elements  in  S',  i.e.,  w{S')  =  5/S 

While  the  budgeted  maximum  coverage  problem  is  NP-complete  there  is  a  (1  —  1/e)  approx¬ 
imation  algorithm  [Khuller,  Moss,  and  Naor,  1999].  Their  algorithm  is  a  variation  of  the  greedy 
algorithm,  where  on  the  one  hand  it  computes  the  greedy  allocation,  where  each  time  a  set  which 
maximizes  the  ratio  between  weight  of  the  elements  covered  and  the  cost  of  the  set  is  added,  as 
long  as  the  budget  constraint  is  not  violated.  On  the  other  hand  the  single  best  set  is  computed. 
The  output  is  the  best  of  the  two  alternative  (either  the  single  best  set  of  the  greedy  allocation). 

Before  we  show  the  reduction  from  a  general  unit  demand  utility  to  the  budgeted  maximum 
coverage  problem,  we  show  a  simpler  case  where  for  each  buyer  j  has  a  value  v3  such  that  of  any 
item  i  either  Vj  =  u:)/l  or  u.^%  =  0,  which  we  call  buyer-uniform  unit  demand. 

Lemma  13.20.  There  is  a  reduction  from  the  budgeted  buyer-uniform  unit  demand  buyers  prob¬ 
lem  to  the  budgeted  maximum  coverage  problem.  In  addition  the  greedy  algorithm  can  be  com¬ 
puted  in  polynomial  time  on  the  resulting  instance. 

Proof.  For  each  buyer  j  we  create  an  element  Xj  with  weight  vr  For  each  item  k  and  any 
subsets  of  buyers  S  we  create  a  set  TS  k  =  { Xj  :  j  e  ,S'}  and  has  cost  costf,:(\S\  ).  The  budget  is 
set  to  be  L  =  B.  Clearly  any  feasible  allocation  of  the  budgeted  maximum  coverage  problem 
TSim,  ■  ■  ■  Tsr,kr  can  be  translated  to  a  solution  of  the  budgeted  buyer-uniform  unit  demand  buyers 
by  simply  producing  item  kt  for  all  the  buyers  in  Ts,  .kl.  The  welfare  is  the  sum  of  the  weight  of 
the  elements  covered  which  is  the  social  welfare,  and  the  cost  is  exactly  the  production  cost. 

Note  that  the  reduction  generates  an  exponential  number  of  sets,  if  we  do  it  explicitly.  How¬ 
ever, we  can  run  the  Greedy  algorithm  easily,  without  generating  the  sets  explicitly.  Assume 
we  have  ml  remaining  buyers.  For  each  item  i  and  any  I  e  [1,  m'\  we  compute  the  cost 
costft)  / gainfi),  where  gainft )  is  the  weight  of  the  I.  buyers  with  highest  valuation  for  item  i. 
Greedy  select  the  item  i  and  number  of  buyers  t  which  have  the  highest  ratio  and  adding  this  set 
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still  satisfies  the  budget  constraint.  Note  that  given  that  greedy  selects  Tg_k  where  S  =  £  then 
its  cost  is  costk{£)  and  its  weigh  is  w{Ts,k)  <  gaink(£),  and  hence  Greedy  will  always  select  one 
of  the  sets  we  are  considering.  □ 

In  the  above  reduction  we  used  very  heavily  the  fact  that  each  buyer  j  has  a  single  valuation 
Vj  regardless  of  which  desired  item  it  gets.  In  the  following  we  show  a  slightly  more  involved 
reduction  which  handles  the  general  unit  demand  buyers. 

Lemma  13.21.  There  is  a  reduction  from  the  budgeted  general  unit  demand  buyers  problem  to 
the  budgeted  maximum  coverage  problem.  In  addition  the  greedy  algorithm  can  be  computed  in 
polynomial  time  on  the  resulting  instance. 


Proof.  For  each  buyer  j  we  sort  its  valuations  u:hU  <  •  •  •  <  uJJrn .  We  set  vJJt  =  uhll  and 
vj,ir  —  uj,ir  ~  uj,ir-i-  Note  that  YH= ,  v:jjs  =  Ujjr.  For  each  buyer  j  we  create  rri  elements  Xj>r, 
1  <  r  <  m.  For  a  buyer  j  and  item  k  let  Xhk  be  all  the  elements  that  represent  lower  valuation 
than  Uj}k,  i.e.,  X]  k  =  { Xj>r  :  Uj:lr  <  uhk}.  For  each  item  k  and  any  subsets  of  buyers  S  we  create 
a  set  Ts,k  =  UjesXj!k  and  has  cost  costk (|S'|).  The  budget  is  set  to  be  L  =  B. 

Any  feasible  allocation  of  the  budgeted  maximum  coverage  problem  Ts1.kll . . .  TSl  .kr  can  be 
translated  to  a  solution  of  the  budgeted  general  unit  demand  buyers  producing  item  k{  for  all  the 
buyers  in  TSi,ki-  We  call  buyer  j  as  winner  if  there  exists  some  b  such  that  xj  b  e  U ri=1TSi,ki.  Let 
Winners  we  the  set  of  all  winner  buyers.  For  any  winner  buyer  j  e  Winner  let  item(j)  =  s 
such  that  s  =  max  {6  :  xj,b  e  U i=1Tsitki}- 

The  cost  of  our  allocation  is  by  definition  at  most  L  =  B.  The  social  welfare  is 


vj,b  = 


y. 


It j,item{j ) 


xXb^<Ji=iTSi,ki  jeWinner 


Again,  note  that  the  reduction  generates  an  exponential  number  of  sets,  if  we  do  it  explicitly. 
However,  we  can  run  the  Greedy  algorithm  easily,  without  generating  the  sets  explicitly.  For 
each  item  i  and  any  £  £  [l,m]  we  compute  the  cost  costft) / gainft),  where  gainf£)  is  the 
weight  of  the  £  buyers  with  highest  valuation  for  item  i.  Greedy  selects  the  item  i  and  number 
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of  buyers  £  which  have  the  highest  ratio  which  still  satisfies  the  budget  constraint.  Note  that 
given  that  greedy  selects  where  ,S'|  =  i  then  its  production  cost  is  costk(£ )  and  its  weight 
is  w(TS)k)  <  gaink(£),  and  hence  Greedy  will  always  select  one  of  the  sets  we  are  considering. 
Once  the  Greedy  selects  a  set  TS)k  we  need  to  update  the  utility  of  any  buyer  j  G  S  for  any 
other  item  i,  by  setting  Uj ^  =  ma x{ujti  —  uhk,  0},  which  is  the  residual  valuation  buyer  j  has  for 
getting  item  i  in  addition  to  item  k.  □ 

Combining  our  reduction  with  approximation  algorithm  of  [Khuller,  Moss,  and  Naor,  1999] 
we  have  the  following  theorem. 

Theorem  13.22.  There  exists  a  poly-time  algorithm  for  the  budgeted  general  unit  demand  buyers 
problem  which  achieves  social  welfare  at  least  (1  —  l/e)OPT. 

13.5.1  Generalization 

To  extend  these  results  to  the  online  setting,  we  will  use  Theorem  13.3  to  represent  allocations 
by  pricing  policies,  and  then  use  the  results  from  above  to  learn  a  good  pricing  policy  based  on 
an  initial  sample. 

Theorem  13.23.  Suppose  every  Uj j  G  [0,  D],  With  i  =  0((l/e2)(r3  \og(rB/e)  +  log(l/<5))) 
random  samples,  with  probability  at  least  1  —  5,  the  empirical  per-customer  social  welfare  is 
within  ±e  of  the  expected  per-customer  social  welfare,  uniformly  over  all  price  vectors  in  [0,  B]r. 

Proof  We  will  show  that,  for  any  distribution  P  and  value  e  >  0,  there  exist  N  =  20(r 3  l°s(rB/e)) 
functions  J\ .....  fN  such  that,  for  every  price  vector  price  G  [0 ,B]r,  the  function  g(x)  = 
£argmaxj<r Si-price,  has  mink<N  f  \fk  -  g\dP  <  e.  This  value  N  is  known  as  the  uniform  e- 
covering  number.  The  result  then  follows  from  standard  uniform  convergence  bounds  (see  e.g., 
[Haussler,  1992]). 

The  function  x  (->•  maxj<r  xt  —  price,  is  a  hyperplane  with  slope  1  in  coordinate  i  and  slope 
0  in  all  other  coordinates.  So  the  subgraph  (i.e.,  the  set  of  r  +  1-dimensional  points  (x,y)  for 
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which  maXj<r  xt  —  price^  >  y  is  a  union  of  r  halfspaces  in  r  +  1  dimensions.  The  space  of 
unions  of  r  halfspaces  in  r  +  1  dimensions  has  VC  dimension  r(r  +  2),  so  this  upper  bounds  the 
pseudo-dimension  of  the  space  of  functions  maXj<r  x%  —  price,,  parametrized  by  the  price  vector 
price.  Therefore,  the  uniform  e-covering  number  of  this  class  is  2°!?'2  lo&(B/e)\ 

For  each  i  <  r,  the  set  of  vectors  x  G  [0,  B]r  such  that  i  =  arg  maxfc  Xk  —  price*.  is  an 
intersection  of  r  halfspaces  in  r  dimensions.  Thus,  the  function  x  i->  priceargmax.a.._price.  is 
contained  in  the  family  of  linear  combinations  of  r  disjoint  intersections  of  r  halfspaces.  The 
VC  dimension  of  an  intersection  of  r  halfspaces  in  r  dimensions  is  r(r  +  1).  So  assuming  the 
prices  are  bounded  in  a  range  [0,  B\,  the  uniform  e-covering  number  for  linear  combinations  (with 
weights  in  [0,  B])  of  r  disjoint  intersections  of  r  halfspaces  is  2°C3l°g(rB/e)).  To  prove  this,  we 
can  take  an  e/(2 rB)  cover  (of  (0,  l}-valued  functions)  of  intersections  of  r  halfspaces,  which 
has  size  (rB /e)°(r2\  and  then  take  an  e/(2 r)  grid  in  [0,5]  and  multiply  each  function  in  the 
cover  by  each  of  these  values  to  get  a  space  of  real-valued  functions;  there  are  ( rB/e )°C2)  total 
functions  in  this  cover,  and  for  each  term  in  the  linear  combination  of  r  disjoint  intersections  of 
r  halfspaces,  at  least  one  of  these  real- valued  functions  will  be  within  e/r  of  it.  Thus,  taking  the 
set  of  sums  of  r  functions  from  this  cover  forms  an  e-cover  of  the  space  of  linear  combinations 
of  r  disjoint  intersections  of  r  halfspaces,  with  size  (r B /e)otri) . 

Now  note  that  £argmaXi(x,_price.)  =  max* (a*  -pricej  +  priceargmax.(a.._pricei).  So  the  uniform 
e-covering  number  for  the  space  of  possible  functions  £argmaXi(Xi-pricei)  is  at  most  the  produce 
of  the  uniform  (e/2)-covering  number  for  the  space  of  functions  x  (->•  max,(xj  —  price,)  and 
the  uniform  (e/2)-covering  number  for  the  space  of  functions  x  priceargmax.(x._price.);  by  the 
above,  this  produce  is  20<r3l°g(rS/e)).  □ 


13.6  Properties  of  /3-nice  cost 

Let  cost(n)  be  a  /Tnice  cost  function.  We  show  a  few  properties  of  it. 
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Claim  13.24. 


cost(2n )  >  cost(n )  I  1  +  — 

V  2p 

Proof.  Let  a  =  cost(n)/n  be  the  average  cost  of  the  first  n  items.  Then  the  cost  of  the  first  2 n 
items  is  at  least  an,  and  has  an  average  cost  of  at  least  a/2.  The  marginal  cost  of  item  2 n  is  at 
least  a/ (2/3).  Therefore  the  cost  of  the  items  n  +  1  to  2 n  is  at  least  an/ {2/3).  □ 

We  can  get  a  better  bound  by  a  more  refine  analysis. 

Claim  13.25.  Let  an  =  cost(n)/n  be  the  average  cost  of  the  first  n  items.  Then, 

1 


and 


®n+ 1  ^ 


1 


n 


1  + 


n  +  1  V  +  1) 


>  al~  f  1  + 


>  e1^2  •  ain-W« 


t=1  \  P(t  +  1) 

Proof.  The  marginal  cost  of  item  n  + 1  is  at  least  an//3.  Therefore  the  cost  of  the  first  items  n  + 1 
is  at  least  nan  +  an/ (/?),  which  gives  the  first  expression. 

We  get  the  expression  of  an  as  a  function  of  a\  by  repeatedly  using  the  recursion.  The 
approximation  follows  from, 


ln(an)  >  ln(ai)  -  ln(n)  +  ^  ln(l  + 


t= l 


f3(n  +  1)' 


> 


ln(a!)  —  ln(n)  + 


“ y  +  1)  (f3(t  +  l))2 


>  ln(ai)  -  In (n)  +  \  In (n)  -  — 

[3  / i 


where  we  used  the  identity  x  —  x2  <  ln(l  +  x). 


□ 
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